?
К задаче автоматической лексико-грамматической разметки старорусского корпуса XV-XVII вв.
The paper discusses two approaches to the automatic lexico-grammatical tagging of the Middle Russian texts (1400–1700), included in the Russian National Corpus (RNC). The task is to assign each token a part of speech label, a tuple of grammatical features, and a lemma (without disambiguation). Middle Russian combines, on the one hand, features of the earlier state of the grammatical system, including aorist and imperfect verb forms, the dual number, a number of archaic inflectional paradigms, and, on the other hand, features of modern Russian inflectional morphology. In lexicon, we can see the same mix of Old Russian and Modern Russian lemmas. Moreover, the texts can contain Church Slavonic and dialectal forms. Absence of a standardised orthography and absence of a standard variant pose even more challenges to processing Middle Russian texts. The first approach is based on writing an electronic dictionary of Old Russian and building a module to handle spelling inconsistency. In the absence of open electronic resources for Middle Russian morphology, an electronic dictionary of Church Slavonic was expanded and adapted to Middle Russian. The paper describes the steps required to change nominal and verbal entries in this dictionary. We follow the principle of «a wider expansion» which presupposes that the analyser is allowed to generate as many annotations as possible so that at least one annotation would be correct. The second approach uses, firstly, an existing Modern Russian tagger supplemented by the module reducing spelling variation, and secondly, a database of lexico-grammatical annotations retrieved from the Diachronic corpus of the RNC. We evaluate the output of both analysers against a manually annotated test data. We also discuss the benchmark scores and outline future prospects for the development of the Middle Russian taggers.