Abstract
Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find that the distributions in the psuedo-corpora exhibit properties found in natural corpora, such as Zipf’s and Heaps’ law, and also ob- serve that the proportion of rare words in a pseudo-corpus affects the performance of its embeddings on word similarity.
| Original language | English |
|---|---|
| DOIs | |
| Publication status | Published - 2019 |
| Event | GWC2019: 10th Global WordNet Conference - Wroclaw, Poland Duration: 23 Jul 2019 → 27 Jul 2019 |
Conference
| Conference | GWC2019: 10th Global WordNet Conference |
|---|---|
| Country/Territory | Poland |
| City | Wroclaw |
| Period | 23/07/19 → 27/07/19 |
Keywords
- word embeddings
- semantic relationships
- lexical knowledge resources
- random walk
- knowledge graph
- pseudo-corpus
- WordNet
- taxonomic graph
- Zipf’s law
- Heaps’ law
- rare words
- word similarity