TY - JOUR
T1 - Size matters
T2 - The impact of training size in taxonomically-enriched word embeddings
AU - Maldonado, Alfredo
AU - Klubička, Filip
AU - Kelleher, John
N1 - Publisher Copyright:
© 2019 A. Maldonado et al.
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity ("topical relatedness") on word pairs such as 'cofee' and 'cup' or 'bus' and 'road'. However, they are less successful on pairs showing taxonomic similarity, like 'cup' and 'mug' (near synonyms) or 'bus' and 'train' (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet's structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like WordNet. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet's structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora beneft the most from taxonomic enrichment whilst embeddings trained on large natural corpora only beneft from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we fnd that, whilst the WordNet structure is fnite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks efectively reinforces taxonomic relations in the learned embeddings.
AB - Word embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity ("topical relatedness") on word pairs such as 'cofee' and 'cup' or 'bus' and 'road'. However, they are less successful on pairs showing taxonomic similarity, like 'cup' and 'mug' (near synonyms) or 'bus' and 'train' (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet's structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like WordNet. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet's structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora beneft the most from taxonomic enrichment whilst embeddings trained on large natural corpora only beneft from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we fnd that, whilst the WordNet structure is fnite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks efectively reinforces taxonomic relations in the learned embeddings.
KW - Retrofitting
KW - Semantic similarity
KW - Taxonomic embeddings
KW - Taxonomic enrichment
KW - Word embeddings
KW - WordNet
UR - https://www.scopus.com/pages/publications/85074506258
U2 - 10.1515/comp-2019-0009
DO - 10.1515/comp-2019-0009
M3 - Article
SN - 2299-1093
VL - 9
SP - 252
EP - 267
JO - Open Computer Science
JF - Open Computer Science
IS - 1
ER -