Abstract
Language embeddings are often used as black-box word-level tools that provide powerful language analysis across many tasks, but yet for many tasks such as Authorship Attribution access to feature level information on character n-grams can provide insights to help with model refinement and development. In this paper we investigate and evaluate the importance of character n-grams within an embeddings context in authorship attribution through the use of attention scores. We perform this investigation both for English (Reuters-50-50) and Russian (Taiga) news authorship datasets. Our analysis show that character n-grams attention score is higher for n-grams that are considered to be important for authorship identification for humans. Beyond specific benefits in authorship attribution, this work provides insights into the importance of character n-grams as a unit within embeddings.
| Original language | English |
|---|---|
| Pages | 939-941 |
| Number of pages | 3 |
| DOIs | |
| Publication status | Published - 27 Mar 2023 |
| Event | 38th Annual ACM Symposium on Applied Computing, SAC 2023 - Tallinn, Estonia Duration: 27 Mar 2023 → 31 Mar 2023 |
Conference
| Conference | 38th Annual ACM Symposium on Applied Computing, SAC 2023 |
|---|---|
| Country/Territory | Estonia |
| City | Tallinn |
| Period | 27/03/23 → 31/03/23 |
Keywords
- attention score
- authorship attribution task
- character n-grams