Bigger versus Similar: Selecting a background corpus for first story detection based on distributional similarity

Fei Wang, Robert J. Ross, John D. Kelleher

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

The current state of the art for First Story Detection (FSD) are nearest neighbour-based models with traditional term vector representations; however, one challenge faced by FSD models is that the document representation is usually defined by the vocabulary and term frequency from a background corpus. Consequently, the ideal background corpus should arguably be both large-scale to ensure adequate term coverage, and similar to the target domain in terms of the language distribution. However, given these two factors cannot always be mutually satisfied, in this paper we examine whether the distributional similarity of common terms is more important than the scale of common terms for FSD. As a basis for our analysis we propose a set of metrics to quantitatively measure the scale of common terms and the distributional similarity between corpora. Using these metrics we rank different background corpora relative to a target corpus. We also apply models based on different background corpora to the FSD task. Our results show that term distributional similarity is more predictive of good FSD performance than the scale of common terms; and, thus we demonstrate that a smaller recent domain-related corpus will be more suitable than a very large-scale general corpus for FSD.

Original languageEnglish
Title of host publicationInternational Conference on Recent Advances in Natural Language Processing in a Deep Learning World, RANLP 2019 - Proceedings
EditorsGalia Angelova, Ruslan Mitkov, Ivelina Nikolova, Irina Temnikova, Irina Temnikova
PublisherIncoma Ltd
Pages1312-1320
Number of pages9
ISBN (Electronic)9789544520557
DOIs
Publication statusPublished - 2019
Event12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019 - Varna, Bulgaria
Duration: 2 Sep 20194 Sep 2019

Publication series

NameInternational Conference Recent Advances in Natural Language Processing, RANLP
Volume2019-September
ISSN (Print)1313-8502

Conference

Conference12th International Conference on Recent Advances in Natural Language Processing, RANLP 2019
Country/TerritoryBulgaria
CityVarna
Period2/09/194/09/19

Fingerprint

Dive into the research topics of 'Bigger versus Similar: Selecting a background corpus for first story detection based on distributional similarity'. Together they form a unique fingerprint.

Cite this