hr500k – A Reference Training Corpus of Croatian.

Nikola Ljubešić, Željko Agić, Filip Klubicka, Vuk Batanović, Tomaž Erjavec

Research output: Contribution to conferencePaperpeer-review

Abstract

In this paper we present hr500k, a Croatian reference training corpus of 500 thousand tokens, segmented at document, sentence and word level, and annotated for morphosyntax, lemmas, dependency syntax, named entities, and semantic roles. We present each annotation layer via basic label statistics and describe the final encoding of the resource in CoNLL and TEI formats. We also give a description of the rather turbulent history of the resource and give insights into the topic and genre distribution in the corpus. Finally, we discuss further enrichments of the corpus with additional layers, which are already underway.
Original languageEnglish
DOIs
Publication statusPublished - 2018
Externally publishedYes
EventLanguage Technologies and Digital Humanities Conference - Ljubljana, Slovenia
Duration: 20 Sep 201821 Sep 2018

Conference

ConferenceLanguage Technologies and Digital Humanities Conference
Country/TerritorySlovenia
CityLjubljana
Period20/09/1821/09/18

Keywords

  • Croatian reference training corpus
  • morphosyntax
  • lemmas
  • dependency syntax
  • named entities
  • semantic roles
  • CoNLL
  • TEI formats
  • topic distribution
  • genre distribution

Fingerprint

Dive into the research topics of 'hr500k – A Reference Training Corpus of Croatian.'. Together they form a unique fingerprint.

Cite this