Synthetic, Yet Natural: Properties of WordNet Random Walk Corpora and the impact of rare words on embedding performance

Filip Klubicka, Alfredo Maldonado, Abhijit Mahalunkar, John Kelleher

Research output: Contribution to conferencePaperpeer-review

Abstract

Creating word embeddings that reflect semantic relationships encoded in lexical knowledge resources is an open challenge. One approach is to use a random walk over a knowledge graph to generate a pseudo-corpus and use this corpus to train embeddings. However, the effect of the shape of the knowledge graph on the generated pseudo-corpora, and on the resulting word embeddings, has not been studied. To explore this, we use English WordNet, constrained to the taxonomic (tree-like) portion of the graph, as a case study. We investigate the properties of the generated pseudo-corpora, and their impact on the resulting embeddings. We find that the distributions in the psuedo-corpora exhibit properties found in natural corpora, such as Zipf’s and Heaps’ law, and also ob- serve that the proportion of rare words in a pseudo-corpus affects the performance of its embeddings on word similarity.
Original languageEnglish
DOIs
Publication statusPublished - 2019
EventGWC2019: 10th Global WordNet Conference - Wroclaw, Poland
Duration: 23 Jul 201927 Jul 2019

Conference

ConferenceGWC2019: 10th Global WordNet Conference
Country/TerritoryPoland
CityWroclaw
Period23/07/1927/07/19

Keywords

  • word embeddings
  • semantic relationships
  • lexical knowledge resources
  • random walk
  • knowledge graph
  • pseudo-corpus
  • WordNet
  • taxonomic graph
  • Zipf’s law
  • Heaps’ law
  • rare words
  • word similarity

Fingerprint

Dive into the research topics of 'Synthetic, Yet Natural: Properties of WordNet Random Walk Corpora and the impact of rare words on embedding performance'. Together they form a unique fingerprint.

Cite this