The smaller the better? Heterogeneity of corpus, training size, and morphological tagging

O. Lyashevskaya; Ostyakova L.

doi:10.28995/2075-7182-2020-19-1091-1108

Publications

?

The smaller the better? Heterogeneity of corpus, training size, and morphological tagging

P. 1091–1108.

Lyashevskaya O., Ostyakova L., Сальников Е. А., Семенова О. А.

Orthographic and morphological heterogeneity of historical texts in pre-modern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.

Keywords: part of speech tagging морфологическая разметка частеречная разметка full morphological tagging historical data corpus size corpus data homogeneity автоматическая обработка исторических текстов

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.). Дополнительный том материалов

M.: ., 2020.

Transformer-based approaches for lemmatizing abbreviations in Russian texts

Glazkova A., Lyashevskaya O., Morozov D. et al., Journal of Mathematical Sciences 2025 Vol. 546 P. 32–47

This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pretrained language models. The first approach is generative, where the ...

Added: March 10, 2026

Грамматический ландшафт художественной прозы: динамика частеречных распределений в русском рассказе XX века

Kirina M., В кн.: Русская грамматика: полипарадигмальность как методологический принцип современных научных исследований : материалы IX Международного научного симпозиума.: Издательство ИГУ, 2025. С. 270–275.

В статье представлены результаты пилотного исследования, направленного на описание дистрибуции частей речи в синхронии и диахронии на материале русской прозы малой формы. Рассматриваются изменения морфологического состава художественных текстов (на уровне грамматических классов) на протяжении XX века в соответствии с 9 историко-культурными периодами. Материалом исследования выступает выборка из 943 рассказов суммарным объемом более 3 млн. словоупотреблений. ...

Added: February 28, 2026

Языковые модели для предобработки текстов в машинном переводе

Mylnikova A., Mylnikov L., Научно-техническая информация. Серия 2: Информационные процессы и системы 2025 № 7 С. 32–44

Рассмотрена модель использования скелетных структур на базе синтаксической разметки для предобработки корпусов текстов перед передачей в нейросетевые модели машинного перевода с целью повышения качества их работы, реализованная с помощью частеречной и синтаксической разметок корпусов текстов, использующих языковую модель, с использованием сети BERT и набора правил. Описана подготовка данных для обучения и предложены способы повышения эффективности ...

Added: September 22, 2025

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 307–318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).: Association for Computational Linguistics, 2023. P. 174–186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023.

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

An HMM-based PoS tagger for Old Church Slavonic

Lyashevskaya O., Afanasev I., Jazykovedny Casopis 2021 Vol. 72 No. 2 P. 556–567

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as ...

Added: October 21, 2021

A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian

Lyashevskaya O., , in: Computational Linguistics and Intellectual TechnologiesIssue 18.: M.: Russian State University for the Humanitie, 2019. P. 422–434.

The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and ...

Added: June 12, 2019

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Sorokin A., Shavrina T., Lyashevskaya O. et al., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" ProceedingsVol. 1. Issue 16 (23).: M.: -, 2017. P. 297–313.

MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem ...

Added: October 9, 2018

Тестовая коллекция для задач автоматического морфологического анализа текстов старорусской письменности

Lyashevskaya O., В кн.: Научное наследие Научное наследие В.А. Богородицкого и современный вектор исследований Казанской лингвистической школы. Труды и материалы межд. конф.Т. 1.: Каз.: Издательство Казанского университета, 2018. С. 131–135.

В статье описывается тестовый корпус объемом ок. 10 тысяч токенов, созданный в качестве стандарта оценки качества систем анализа старорусских текстов XV-XVII в. Излагаются принципы отбора текстов и процедура их разметки. ...

Added: October 9, 2018

Redefining part-of-speech classes with distributional semantic models

Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.: Berlin: Association for Computational Linguistics, 2016. P. 115–125.

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...

Added: November 12, 2016

Параметрическая оптимизация точности морфологической разметки текстов

Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2016

Статья знакомит читателя с базовыми понятиями параметрической оптимизации. Описывается разработанная модель аппроксимация вероятности, функции-счётчики и коэффициенты корреляции. Небольшое внимание уделено методу полного перебора, в результате работы которого достигнуты новые показатели точности. В конце приведена модификация метода снятия омонимии, разработанная авторами. ...

Added: June 14, 2016

Морфосинтаксическая разметка текста на китайском языке с помощью статистических анализаторов: методика, оценка качества.

Kubatieva A., В кн.: I Молодежная международная конференция «Методы точных наук в востоковедении», 10-11 ноября 2015 г.: Материалы конференции.: СПб.: Издательство РХГА, 2015.

In this paper, we describe basic principles of POS-classifications and their modelling for POS-tagging of Chinese and statistical NLP systems. Using three available statistical POS-taggers, we conducted an experiment on POS-tagging of Chinese text to analyze quality evaluation, correspondence between POS-tags and categories assigned in different reference grammars. We also determine the basic rules of ...

Added: December 10, 2015

Статистические методы снятия омонимии

Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2015 С. 555–563

Статья знакомит читателя со статистическими методами устранения морфологической неоднозначности. Описывается процесс насыщения, параметры методов и n-грамм. Большое внимание уделено методам снятия омонимии, в обзоре которых описания сопровождены практическими оценками и даны алгоритмы их работы. В конце приведено сравнение качества методов дизамбигуации, осуществлённое авторами. ...

Added: November 25, 2015

Методы борьбы с омонимией

Рысаков С. В., Системный администратор 2015 № 10(155) С. 92–95

The article provides a review of modern methods of morphological ambiguity resolution. We considered such methods as statistical disambiguation, Brill’s automatically generated rules, decision trees and their modifications. For the comparison, the article provides numerical results obtained on two open corpora: OpenCorpora and SynTagRus. ...

Added: November 25, 2015

Crowdsourcing morphological annotation

Bocharov V. V., Alexeeva S. V., Granovsky D. V. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т.Т. 1: Основная программа конференции. Вып. 12 (19).: М.: РГГУ, 2013.

Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In Open Corpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report ...

Added: November 18, 2013