• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Working paper

Some challenges of the West Circassian polysynthetic corpus

Linguistics. WP BRP. НИУ ВШЭ, 2015. No. 37/LNG/2015.
Arkhangelskiy T., Lander Yu.
Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Polysynthetic languages raise a variety of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for e. g. Turkic or Uralic languages, while others are unique for this kind of languages. Our paper identifies the most prominent challenges that we are facing in the course of development of West Circassian (Adyghe) corpus, and offer possible solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem with lemmatization and part-of-speech tagging, and a number of glossing and search problems.