Abstract
Crosslingual document classification aims to classify documents written in different languages that share a common genre, topic or author. Knowledge-based methods and others based on machine translation deliver state-of-the-art classification accuracy, however because of their reliance on external resources, poorly resourced languages present a challenge for these type of methods. In this paper, we propose a novel set of language independent features that capture language use from a document at a deep level, using features that are intrinsic to the document. These features are based on vocabulary richness measurements and are text length independent and self-contained, meaning that no external resources such as lexicons or machine translation software are needed. Preliminary evaluation results show promising results for the task of crosslingual authorship attribution, outperforming similar methods.
| Original language | English |
|---|---|
| Pages (from-to) | 16-25 |
| Number of pages | 10 |
| Journal | CEUR Workshop Proceedings |
| Volume | 1589 |
| DOIs | |
| Publication status | Published - 2016 |
| Event | 1st Workshop on Modeling, Learning and Mining for Cross/Multilinguality, MultiLingMine 2016 - Padova, Italy Duration: 20 Mar 2016 → … |
Keywords
- Crosslingual authorship attribution
- Crosslingual document classification
- Deep level lexical features
- Vocabulary richness features