• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • The smaller the better? Heterogeneity of corpus, training size, and morphological tagging
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
May 25, 2026
HSE Scientists Train Neural Network to 'Hear' Faults in Electric Motors
Researchers at the AI and Digital Science Institute of the HSE Faculty of Computer Science have developed a new method—the Signature-Guided Data Augmentation (SGDA) framework—that achieves 99% accuracy in motor fault detection and 86% accuracy in fault classification. The application of this approach can reduce industrial equipment repair costs, minimise downtime, and improve production safety. The study results have been published in Engineering Applications of Artificial Intelligence.
May 25, 2026
'The Humanities Serve as a Conscience'
Maria Mizernaia studies Soviet literature and the history of book publishing. In this interview for the HSE Young Scientists project, she discusses plans to publish a novel about besieged Leningrad, AI-provoked reflections on what it means to be human, and how novels can help satisfy our dopamine hunger.
May 25, 2026
Is It Possible to Predict a Citys Life Based on the Shape of Its Neighbourhoods?
Is it possible to predict, based on the configuration of streets and buildings, where a café will open or where traffic congestion will occur? Participants in the Spatial Analysis and Modelling of Urban Processes research and study group use open data and machine learning to identify universal patterns. Alexander Sheludkov and Eduard Somov discuss the purpose of comparing cities, the need for new forms of urban statistics, and how open data is transforming approaches to urban studies.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

The smaller the better? Heterogeneity of corpus, training size, and morphological tagging

P. 1091–1108.
Lyashevskaya O., Ostyakova L., Сальников Е. А., Семенова О. А.

Orthographic and morphological heterogeneity of historical texts in pre-modern Slavic causes many difficulties in pos- and morphological tagging. Existing approaches to these tasks show state-of-the-art results without normalization, but they are still very sensitive to the properties of training data such as genre and origin. In this paper, we investigate to what extent the heterogeneity and size of the training corpus influence the quality of pos tagging and morphological analysis. We observe that UDpipe trained on different parts of the Middle Russian corpus demonstrates a boost in accuracy when using less training data. We resolve this paradox by analyzing the distribution of pos-tags and short words across subcorpora.

Language: English
Full text
DOI
Text on another site
Keywords: part of speech taggingморфологическая разметкачастеречная разметкаfull morphological tagginghistorical datacorpus sizecorpus data homogeneityавтоматическая обработка исторических текстов

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.). Дополнительный том материалов
M.: ., 2020.
Similar publications
Transformer-based approaches for lemmatizing abbreviations in Russian texts
Glazkova A., Lyashevskaya O., Morozov D. et al., Journal of Mathematical Sciences 2025 Vol. 546 P. 32–47
This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pretrained language models. The first approach is generative, where the ...
Added: March 10, 2026
Грамматический ландшафт художественной прозы: динамика частеречных распределений в русском рассказе XX века
Kirina M., В кн.: Русская грамматика: полипарадигмальность как методологический принцип современных научных исследований : материалы IX Международного научного симпозиума.: Издательство ИГУ, 2025. С. 270–275.
В статье представлены результаты пилотного исследования, направленного на описание дистрибуции частей речи в синхронии и диахронии на материале русской прозы малой формы. Рассматриваются изменения морфологического состава художественных текстов (на уровне грамматических классов) на протяжении XX века в соответствии с 9 историко-культурными периодами. Материалом исследования выступает выборка из 943 рассказов суммарным объемом более 3 млн. словоупотреблений. ...
Added: February 28, 2026
Языковые модели для предобработки текстов в машинном переводе
Mylnikova A., Mylnikov L., Научно-техническая информация. Серия 2: Информационные процессы и системы 2025 № 7 С. 32–44
Рассмотрена модель использования скелетных структур на базе синтаксической разметки для предобработки корпусов текстов перед передачей в нейросетевые модели машинного перевода с целью повышения качества их работы, реализованная с помощью частеречной и синтаксической разметок корпусов текстов, использующих языковую модель, с использованием сети BERT и набора правил. Описана подготовка данных для обучения и предложены способы повышения эффективности ...
Added: September 22, 2025
Disambiguation in context in the Russian National Corpus: 20 yeas later
Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 307–318.
An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...
Added: September 15, 2023
The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group
Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).: Association for Computational Linguistics, 2023. P. 174–186.
The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...
Added: May 15, 2023
Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Association for Computational Linguistics, 2023.
These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...
Added: May 15, 2023
An HMM-based PoS tagger for Old Church Slavonic
Lyashevskaya O., Afanasev I., Jazykovedny Casopis 2021 Vol. 72 No. 2 P. 556–567
We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as ...
Added: October 21, 2021
A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian
Lyashevskaya O., , in: Computational Linguistics and Intellectual TechnologiesIssue 18.: M.: Russian State University for the Humanitie, 2019. P. 422–434.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and ...
Added: June 12, 2019
MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian
Sorokin A., Shavrina T., Lyashevskaya O. et al., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" ProceedingsVol. 1. Issue 16 (23).: M.: -, 2017. P. 297–313.
MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem ...
Added: October 9, 2018
Тестовая коллекция для задач автоматического морфологического анализа текстов старорусской письменности
Lyashevskaya O., В кн.: Научное наследие Научное наследие В.А. Богородицкого и современный вектор исследований Казанской лингвистической школы. Труды и материалы межд. конф.Т. 1.: Каз.: Издательство Казанского университета, 2018. С. 131–135.
В статье описывается тестовый корпус объемом ок. 10 тысяч токенов, созданный в качестве стандарта оценки качества систем анализа старорусских текстов XV-XVII в. Излагаются принципы отбора текстов и процедура их разметки. ...
Added: October 9, 2018
Redefining part-of-speech classes with distributional semantic models
Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.: Berlin: Association for Computational Linguistics, 2016. P. 115–125.
This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work  targets  the  Universal  PoS  tag  set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...
Added: November 12, 2016
Параметрическая оптимизация точности морфологической разметки текстов
Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2016
Статья знакомит читателя с базовыми понятиями параметрической оптимизации. Описывается разработанная модель аппроксимация вероятности, функции-счётчики и коэффициенты корреляции. Небольшое внимание уделено методу полного перебора, в результате работы которого достигнуты новые показатели точности. В конце приведена модификация метода снятия омонимии, разработанная авторами. ...
Added: June 14, 2016
Морфосинтаксическая разметка текста на китайском языке с помощью статистических анализаторов: методика, оценка качества.
Kubatieva A., В кн.: I Молодежная международная конференция «Методы точных наук в востоковедении», 10-11 ноября 2015 г.: Материалы конференции.: СПб.: Издательство РХГА, 2015.
In this paper, we describe basic principles of POS-classifications and their modelling for POS-tagging of Chinese and statistical NLP systems. Using three available statistical POS-taggers, we conducted an experiment on POS-tagging of Chinese text to analyze quality evaluation, correspondence between POS-tags and categories assigned in different reference grammars. We also determine the basic rules of ...
Added: December 10, 2015
Статистические методы снятия омонимии
Klyshinskiy E., Рысаков С. В., Новые информационные технологии в автоматизированных системах 2015 С. 555–563
Статья знакомит читателя со статистическими методами устранения морфологической неоднозначности. Описывается процесс насыщения, параметры методов и n-грамм. Большое внимание уделено методам снятия омонимии, в обзоре которых описания сопровождены практическими оценками и даны алгоритмы их работы. В конце приведено сравнение качества методов дизамбигуации, осуществлённое авторами. ...
Added: November 25, 2015
Методы борьбы с омонимией
Рысаков С. В., Системный администратор 2015 № 10(155) С. 92–95
The article provides a review of modern methods of morphological ambiguity resolution. We considered such methods as statistical disambiguation, Brill’s automatically generated rules, decision trees and their modifications. For the comparison, the article provides numerical results obtained on two open corpora: OpenCorpora and SynTagRus. ...
Added: November 25, 2015
Crowdsourcing morphological annotation
Bocharov V. V., Alexeeva S. V., Granovsky D. V. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т.Т. 1: Основная программа конференции. Вып. 12 (19).: М.: РГГУ, 2013.
Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In Open Corpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report ...
Added: November 18, 2013
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit