Взiaлъ, възялъ, вьзял: Обработка орфографической вариативности при лексико-грамматической аннотации старорусского корпуса XV-XVII вв.
The highly unstable orthography of the Middle Russian texts poses a challenge for their automatic processing. The Middle Russian subcorpus of the Russian National Corpus (RNC) includes documents written mainly between 1400 and 1700, when the variation in spelling was still a norm. The task of lexico-grammatical analysis is to assign a dictionary form (lemma), part of speech and grammatical tags to each word form in the corpus. Traditional methods of pos- and grammatical tagging assume that there can be (almost only) one possible string of characters representing the stem and ending of each grammatical form of the word. Since unstable orthography yields many-to-many mapping between word forms and grammatical annotations, morphological taggers perform poorly and need orthographic normalization preprocessing.
We use both relative and absolute normalization of orthographic representation. The relative normalization involves multiplying orthographic representations of stems and endings in the grammatical dictionary by regular rules. It is carried out at the level of (a) word endings; (b) nominative stems with regular variation, e.g. russk(ij) / russt(ij), keli(ja) / kel'(ja); (c) nominative stems of the Church Slavonic origin, e.g. odin- / edin-; (d) verb stems with prefixes; etc. The absolute normalization matches characters (character combinations) which alternate regularly in the corpus (e.g. o / ѡ 'omega', e / ѣ, шт / щ, жю / жу). The absolute normalization applies to both orthographic representations in the grammatical dictionary and word forms in the text.