The smaller the better? Heterogeneity of corpus, training size, and morphological tagging

O. Lyashevskaya; Ostyakova L.

doi:10.28995/2075-7182-2020-19-1091-1108

Publications

?

The smaller the better? Heterogeneity of corpus, training size, and morphological tagging

P. 1091-1108.

Lyashevskaya O., Ostyakova L., Сальников Е. А., Семенова О. А.

Orthographic and morphological heterogeneity of historical texts in pre-modern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.

Keywords: part of speech tagging морфологическая разметка частеречная разметка full morphological tagging historical data corpus size corpus data homogeneity автоматическая обработка исторических текстов

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.). Дополнительный том материалов

M. : ., 2020

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in : Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22. Вып. 22.: [б.и.], 2023. P. 307-318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

An HMM-based PoS tagger for Old Church Slavonic

Lyashevskaya O., Afanasev I., Jazykovedny Casopis 2021 Vol. 72 No. 2 P. 556-567

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as ...

Added: October 21, 2021

Статистические методы снятия омонимии

Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2015 С. 555-563

Статья знакомит читателя со статистическими методами устранения морфологической неоднозначности. Описывается процесс насыщения, параметры методов и n-грамм. Большое внимание уделено методам снятия омонимии, в обзоре которых описания сопровождены практическими оценками и даны алгоритмы их работы. В конце приведено сравнение качества методов дизамбигуации, осуществлённое авторами. ...

Added: November 25, 2015

Crowdsourcing morphological annotation

Bocharov V. V., Alexeeva S. V., Granovsky D. V. et al., , in : Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т. Т. 1: Основная программа конференции. Вып. 12 (19).: М. : РГГУ, 2013.

Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In Open Corpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report ...

Added: November 18, 2013

Morphological Analysis for Russian: Integration and Comparison of Taggers

Kuzmenko E., Communications in Computer and Information Science 2016 No. 661 P. 194-203

In this paper we present a comparison of three morphological taggers for Russian with regard to the quality of morphological disambiguation performed by these taggers. We test the quality of the analysis in three different ways: lemmatization, POS-tagging and assigning full morphological tags. We analyze the mistakes made by the taggers, outline their strengths and ...

Added: June 10, 2016

Тестовая коллекция для задач автоматического морфологического анализа текстов старорусской письменности

Lyashevskaya O., В кн. : Научное наследие Научное наследие В.А. Богородицкого и современный вектор исследований Казанской лингвистической школы. Труды и материалы межд. конф. Т. 1.: Каз. : Издательство Казанского университета, 2018. С. 131-135.

В статье описывается тестовый корпус объемом ок. 10 тысяч токенов, созданный в качестве стандарта оценки качества систем анализа старорусских текстов XV-XVII в. Излагаются принципы отбора текстов и процедура их разметки. ...

Added: October 9, 2018

Морфосинтаксическая разметка текста на китайском языке с помощью статистических анализаторов: методика, оценка качества.

Kubatieva A., В кн. : I Молодежная международная конференция «Методы точных наук в востоковедении», 10-11 ноября 2015 г.: Материалы конференции. : СПб. : Издательство РХГА, 2015.

In this paper, we describe basic principles of POS-classifications and their modelling for POS-tagging of Chinese and statistical NLP systems. Using three available statistical POS-taggers, we conducted an experiment on POS-tagging of Chinese text to analyze quality evaluation, correspondence between POS-tags and categories assigned in different reference grammars. We also determine the basic rules of ...

Added: December 10, 2015

Методы борьбы с омонимией

Рысаков С. В., Системный администратор 2015 № 10(155) С. 92-95

The article provides a review of modern methods of morphological ambiguity resolution. We considered such methods as statistical disambiguation, Brill’s automatically generated rules, decision trees and their modifications. For the comparison, the article provides numerical results obtained on two open corpora: OpenCorpora and SynTagRus. ...

Added: November 25, 2015

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Sorokin A., Shavrina T., Lyashevskaya O. et al., , in : Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" Proceedings. Vol. 1. Issue 16 (23).: M. : -, 2017. P. 297-313.

MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem ...

Added: October 9, 2018

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in : Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). : Association for Computational Linguistics, 2023. P. 174-186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

Redefining part-of-speech classes with distributional semantic models

Kutuzov A. B., Velldal E., Øvrelid L., , in : Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. : Berlin : Association for Computational Linguistics, 2016. P. 115-125.

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...

Added: November 12, 2016

Параметрическая оптимизация точности морфологической разметки текстов

Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2016

Статья знакомит читателя с базовыми понятиями параметрической оптимизации. Описывается разработанная модель аппроксимация вероятности, функции-счётчики и коэффициенты корреляции. Небольшое внимание уделено методу полного перебора, в результате работы которого достигнуты новые показатели точности. В конце приведена модификация метода снятия омонимии, разработанная авторами. ...

Added: June 14, 2016

A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian

Lyashevskaya O., , in : Computational Linguistics and Intellectual Technologies. Issue 18.: M. : Russian State University for the Humanitie, 2019. P. 422-434.

The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and ...

Added: June 12, 2019

Автоматическое определение частей речи для русского языка с помощью обучения трансформаций.

Kitov V. V., Научные труды Вольного экономического общества России 2014 Т. 186 С. 228-235

This paper describes the application of well-known «transformation-based learning» algorithm of automatic rule generation for the task of part-of-speech tagging. Algorithm is applied to corpora of annotated Russian texts and accuracy as well as most significant rules are shown. ...

Added: March 16, 2016