• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Working paper

Effort versus performance tradeoff in lemmatisation for Uralic languages

Proceedings of the Sixth International Workshop on Computational Linguistics of Uralic Languages. 2020.iwclul-1.2. Association for Computational Linguistics, 2020
Tyers F. M., Bibaeva M.
Lemmatisers in Uralic languages are required for dictionary lookup, an important task for language learners. We explore how to decide which of the rule-based and unsupervised categories is more efficient to invest in. We present a comparison of rule-based and unsupervised lemmatisers, derived from the Giellatekno finite-state morphology project and the Morfessor surface segmenter trained on Wikipedia, respectively. The comparison spanned six Uralic languages, from relatively high-resource (Finnish) to extremely low-resource (Uralic languages of Russia).  Performance is measured by dictionary lookup and vocabulary reduction tasks on the Wikipedia corpora.  Linguistic input was quantified, for rule-based as quantity of source code and state machine complexity, and for unsupervised as the size of the training corpus; these are normalised against Finnish.  Most languages show performance improving with linguistic input.  Future work will produce quantitative estimates for the relationship between corpus size, ruleset size, and lemmatisation performance.