Evaluation for morphologically rich language: Russian NLP

Toldova S.; Lyashevskaya O.; A. A. Bonch-Osmolovskaya; Ionov M.

?

Evaluation for morphologically rich language: Russian NLP

P. 300–306.

Toldova S., Lyashevskaya O., Bonch-Osmolovskaya A. A., Ionov M.

Abstract - RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP resources, methods and toolkits and to compare various methods and principles implemented for Russian. Russian could be treated as an under-resourced language due to the lack of free distributable gold standard corpora for different NLP tasks (each team tried to work out their own standards). Thus, our goal was to work out the uniform basis for comparison of systems based on different theoretical and engineering approaches, to build evaluation resources, to provide a flexible system of evaluation in order to differentiate between non-acceptable and linguistically “admissible” errors. The paper reports on three events devoted to morphological tagging, dependency parsing and anaphora resolution, respectively.

Language: English

Full text

Text on another site

Keywords: NLP evaluation coreference resolution morphological tagging dependency parsing

In book

Proceedings on the International Conference on Artificial Intelligence (ICAI)

Vol. 1. , Las Vegas: CSREA Press, 2015.

Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Tartu: University of Tartu Library, 2025.

The third workshop on resources and representations for under-resourced languages and domains was held in Tallinn, Estonia, on March 2nd, 2025. The workshop was conducted in person but also provided an option for online participation. In alignment with the goals of the previous two workshops in 2020 and 2023, RESOURCEFUL-2025 explored the role of resource ...

Added: July 17, 2025

From web to dialects: how to enhance non-standard Russian lects lemmatisation?

Afanasev I., Lyashevskaya O., , in: Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD).: Gothenburg: Association for Computational Linguistics, 2023. P. 167–175.

The growing need for using small data distinguished by a set of distributional properties becomes all the more apparent in the era of large language models (LLM). In this paper, we show that for the lemmatisation of the web as corpora texts, heterogeneous social media texts, and dialect texts, the morphological tagging by a model ...

Added: December 10, 2023

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 307–318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).: Association for Computational Linguistics, 2023. P. 174–186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023.

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

Proceedings of the First Workshop on Computational Approaches to Discourse

Association for Computational Linguistics, 2020.

Added: November 18, 2020

Humans Keep It One Hundred: an Overview of AI Journey

Shavrina T., Emelyanov A., Fenogenova A. et al., , in: Proceedings of The 12th Language Resources and Evaluation ConferenceVol. 12.: European Language Resources Association (ELRA), 2020. P. 2276–2284.

Artificial General Intelligence (AGI) is showing growing performance in numerous applications - beating human performance in Chess and Go, using knowledge bases and text sources to answer questions (SQuAD) and even pass human examination (Aristo project). In this paper, we describe the results of AI Journey, a competition of AI-systems aimed to improve AI performance ...

Added: June 15, 2020

Proceedings of The 12th Language Resources and Evaluation Conference

European Language Resources Association (ELRA), 2020.

Welcome to the 12th edition of LREC . . . that should have been in Marseille, first time in France! Unfortunately not now, in May 2020. Now my welcome is completely virtual, to all of you authors of these Proceedings papers and to the colleagues who will look at these. Virtual but not less sincere. ...

Added: June 15, 2020

A cross-genre morphological tagging and lemmatization of the Russian poetry: distinctive test sets and evaluation

Starchenko A., Lyashevskaya O., , in: Digital Transformation and Global Society. Fourth International Conference, DTGS 2019, St. Petersburg, Russia, June 19–21, 2019, Revised Selected Papers.: Springer, 2019. P. 732–743.

The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic ...

Added: June 12, 2019

Applying statistical tagging to Russian poetry

Starchenko A., Kazakevich L., Lyashevskaya O., / NRU HSE. Series WP BRP "Linguistics". 2018. No. 76.

Added: December 12, 2018

Data Conversion and Consistency of Monolingual Corpora: Russian UD Treebanks

Дроганова К. А., Lyashevskaya O., Zeman D., , in: Proceedings of TLT 2018 International Workshop on Treebanks and Linguistic Theories, 13-14 November 2018, Oslo, Norway. NEALT Proceedings Series.: Linköping University Electronic Press, 2018. P. 52–65.

In this paper we focus on syntactic annotation consistency within Universal Dependencies (UD) treebanks for Russian: UD_Russian-SynTagRus, UD_Russian-GSD, UD\_Russian-Taiga, and UD_Russian-PUD. We describe the four treebanks, their distinctive features and development. In order to test and improve consistency within the treebanks, we reconsidered the experiments by Martinez Alonso and Zeman; our parsing experiments were conducted ...

Added: November 6, 2018

Automatic morphological analysis on the material of Russian social media texts

Fenogenova A., Kazorin V., Karpov I. et al., , in: Proceedings of Third Workshop "Computational linguistics and language science"Issue 4.: Manchester: EasyChair, 2019. P. 11–17.

Automatic morphological analysis is one of the fundamental and significant tasks of NLP (Natural Language Processing). Due to special features of Internet texts, as they can be both normative texts (news, fiction, nonfiction) and less formal texts (such as blogs and texts from social networks), the morphological tagging has become non-trivial and an actual task. ...

Added: October 5, 2018

Employing Wikipedia data for coreference resolution in Russian

Azerkovich I., , in: Artificial Intelligence and Natural Language, 7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, ProceedingsIssue 930.: Switzerland: Springer, 2018. P. 107–112.

Semantic information has been deemed a valuable resource for solving the task of coreference resolution by many researchers. Unfortunately, not much has been done in the direction of using this data when working with Russian data. This work describes the first step of a research, attempting to create a coreference resolution system for Russian based on semantic data, concerned with ...

Added: September 5, 2018

Features for Discourse-New Referent Detection in Russian

Toldova S., Ionov M., , in: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 17th International Conference on Intelligent Text Processing and Computational Linguistics, CICLing 2016Vol. 1. Issue 9623.: Springer Publishing Company, 2018. P. 648–662.

This paper concerns discourse-new mention detection in Russian. This might be helpful for different NLP applications such as coreference resolution, protagonist identification, summarization and different tasks of information extraction to detect the mention of an entity newly introduced into discourse. In our work, we are dealing with the Russian where there is no grammatical devices, ...

Added: September 1, 2018

Text collections for evaluation of Russian morphological taggers

Lyashevskaya O., Bocharov V., Sorokin A. et al., Jazykovedny Casopis 2017 Vol. 68 No. 2 P. 258–267

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single ...

Added: January 30, 2018

Identification of Singleton Mentions in Russian

Toldova S., Max Ionov, , in: CLLS 2016. Computational Linguistics and Language Science. Proceedings of the Workshop on Computational Linguistics and Language Science. Moscow, Russia, April 26, 2016Vol. 1886.: Aachen: CEUR Workshop Proceedings, 2017. Ch. 5 P. 33–41.

This paper describes a pilot study of the problem of detecting singleton mentions in Russian texts. A noun phrase is considered a singleton mention if it is the only referent of some entity. We discuss various morphosyntactic and lexical features, some of which were used for analogous tasks for English and propose new features derived ...

Added: November 9, 2017

Coreference resolution for Russian: the impact of semantic features

Toldova S., Maxim Ionov, , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" ProceedingsVol. 1. Issue 16 (23).: M.: -, 2017. P. 339–348.

This paper presents the results of our experiments on building a general coreference resolution system for Russian. The main aim of those experiments was to set a baseline for this task for Russian using the standard set of features developed and tested for coreference resolution systems created for other languages. We propose several baseline systems, ...

Added: July 12, 2017

Mention Detection for Improving Coreference Resolution in Russian Texts: A Machine Learning Approach

Toldova S., Ionov M., Computacion y Sistemas 2016 Vol. 20 No. 4 P. 681–696

The paper concerns discourse-new referent detection. The task of coreference resolution is essential in many text-mining applications. The focus in this task is to detect noun phrases (NPs) that refer to the same entity. In languages without articles, there are no overt grammatical clues in an NP for whether it introduces a new referent into ...

Added: December 27, 2016

Error analysis for anaphora resolution in Russian: new challenging issues for anaphora resolution task in a morphologically rich language

Anna Roytberg, Toldova S., Alina Ladygina et al., , in: Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016), co-located with NAACL 2016, San Diego, California, June 16, 2016.: Stroudsburg, PA: Association for Computational Linguistics, 2016. P. 74–83.

This paper presents a quantitative and qualitative error analysis of Russian anaphora resolvers which participated in the RU-EVAL event. Its aim is to identify and characterize a set of challenging errors common to stateof-the-art systems dealing with Russian. We examined three types of pronouns: 3rd person pronouns, reflexive and relative pronouns. The investigation has shown ...

Added: December 7, 2016