Deep level lexical features for cross-lingual authorship attribution

Research output: Contribution to journalConference articlepeer-review

Abstract

Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods.

Original languageEnglish
Pages (from-to)16-25
Number of pages10
JournalCEUR Workshop Proceedings
Volume1589
DOIs
Publication statusPublished - 2016
Event1st Workshop on Modeling, Learning and Mining for Cross/Multilinguality, MultiLingMine 2016 - Padova, Italy
Duration: 20 Mar 2016 → …

Keywords

  • Crosslingual authorship attribution
  • Crosslingual document classification
  • Deep level lexical features
  • Vocabulary richness features

Fingerprint

Dive into the research topics of 'Deep level lexical features for cross-lingual authorship attribution'. Together they form a unique fingerprint.

Cite this