Update Frequency and Background Corpus Selection in Dynamic TF-IDF Models for First Story Detection

Fei Wang, Robert J. Ross, John D. Kelleher

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

First Story Detection (FSD) requires a system to detect the very first story that mentions an event from a stream of stories. Nearest neighbour-based models, using the traditional term vector document representations like TF-IDF, currently achieve the state of the art in FSD. Because of its online nature, a dynamic term vector model that is incrementally updated during the detection process is usually adopted for FSD instead of a static model. However, very little research has investigated the selection of hyper-parameters and the background corpora for a dynamic model. In this paper, we analyse how a dynamic term vector model works for FSD, and investigate the impact of different update frequencies and background corpora on FSD performance. Our results show that dynamic models with high update frequencies outperform static model and dynamic models with low update frequencies; and that the FSD performance of dynamic models does not always increase with higher update frequencies, but instead reaches steady state after some update frequency threshold is reached. In addition, we demonstrate that different background corpora have very limited influence on the dynamic models with high update frequencies in terms of FSD performance.

Original languageEnglish
Title of host publicationComputational Linguistics - 16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019, Revised Selected Papers
EditorsLe-Minh Nguyen, Satoshi Tojo, Xuan-Hieu Phan, Kôiti Hasida
PublisherSpringer
Pages206-217
Number of pages12
ISBN (Print)9789811561672
DOIs
Publication statusPublished - 2020
Event16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019 - Hanoi, Viet Nam
Duration: 11 Oct 201913 Oct 2019

Publication series

NameCommunications in Computer and Information Science
Volume1215 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference16th International Conference of the Pacific Association for Computational Linguistics, PACLING 2019
Country/TerritoryViet Nam
CityHanoi
Period11/10/1913/10/19

Keywords

  • Background corpus
  • First Story Detection
  • Nearest neighbour
  • Novelty detection
  • TF-IDF
  • Update frequency

Fingerprint

Dive into the research topics of 'Update Frequency and Background Corpus Selection in Dynamic TF-IDF Models for First Story Detection'. Together they form a unique fingerprint.

Cite this