Evaluation for morphologically rich language: Russian NLP
Abstract - RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP resources, methods and toolkits and to compare various methods and principles implemented for Russian. Russian could be treated as an under-resourced language due to the lack of free distributable gold standard corpora for different NLP tasks (each team tried to work out their own standards). Thus, our goal was to work out the uniform basis for comparison of systems based on different theoretical and engineering approaches, to build evaluation resources, to provide a flexible system of evaluation in order to differentiate between non-acceptable and linguistically “admissible” errors. The paper reports on three events devoted to morphological tagging, dependency parsing and anaphora resolution, respectively.
The paper reports on the recent forum RU-EVAL ‒ a new initiative for evaluation of Russian NLP resources, methods and toolkits. The first two events were devoted to morphological and syntactic parsing correspondingly. The third event is devoted to anaphora and coreference resolution. Seven participating IT companies and academic institutions submitted their results for anaphora resolution task and three of them presented the results of coreference resolution task as well. The event was organized in order to estimate the state of the art for this NLP task in Russian and to compare various methods and principles implemented for Russian. We discuss the evaluation procedure. The anaphora and coreference tasks are specified in the present work. The phenomena taken into consideration are described. We also give a brief outlook of the similar evaluation events whose experience we lay upon. In our work we formulate the training and Gold Standard corpora construction guidelines and present the measures used in evaluation.
The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.
This paper concerns discourse-new mention detection in Russian. This might be helpful for different NLP applications such as coreference resolution, protagonist identification, summarization and different tasks of information extraction to detect the mention of an entity newly introduced into discourse. In our work, we are dealing with the Russian where there is no grammatical devices, like articles in English, for the overt marking a newly introduced referent. Our aim is to check the impact of various features on this task. The focus is on specific devices for introducing a new discourse prominent referent in Russian specified in theoretical studies. We conduct a pilot study of features impact and provide a series of experiments on detecting the first mention of a referent in a non-singleton coreference chain, drawing on linguistic insights about how a prominent entity introduced into discourse is affected by structural, morphological and lexical features.
Welcome to the 12th edition of LREC . . . that should have been in Marseille, first time in France! Unfortunately not now, in May 2020. Now my welcome is completely virtual, to all of you authors of these Proceedings papers and to the colleagues who will look at these. Virtual but not less sincere. This LREC would have also been an occasion to celebrate the 25th anniversary of ELRA. We are proud that ELRA is becoming a mature association. And LREC too. LREC started in 1998, 22 years ago. We hope to welcome you in a non-virtual way next year in Marseille. We will enjoy together not only the conference but also the special “light” of Marseille and the wonderful view of the Mediterranean and the city from the Palais du Pharo.
Many NLP researchers, especially those not working in the area of discourse processing, tend to equate coreference resolution with the sort of coreference that people did in MUC, ACE, and OntoNotes, having the impression that coreference is a well-worn task owing in part to the large number of papers reporting results on the MUC/ACE/OntoNotes corpora. Given the plethora of work on entity coreference and aware of other fora gathering coreferencerelated papers (such as LAW, DiscoMT or EVENTS), we believed that time was ripe for a new workshop on the single topic of coreference resolution that would bring together researchers who were interested in under-investigated coreference phenomena, willing to contribute both theoretical and applied computational work on coreference resolution, especially for languages other than English, less-researched forms of coreference and new applications of coreference resolution.
The paper concerns discourse-new referent detection. The task of coreference resolution is essential in many text-mining applications. The focus in this task is to detect noun phrases (NPs) that refer to the same entity. In languages without articles, there are no overt grammatical clues in an NP for whether it introduces a new referent into discourse or it refers to one of before-mentioned entities. However, there are some theoretical researches which claim that referent first-mentioning NPs have some specific features. In our research, we examine features that serve as discourse-new detectors for NPs corresponding to discourse salient referents and provide an experiment on different features contribution to this detection. The first-mention detection could help the quality of coreference resolution systems.
This paper describes the application of well-known «transformation-based learning» algorithm of automatic rule generation for the task of part-of-speech tagging. Algorithm is applied to corpora of annotated Russian texts and accuracy as well as most significant rules are shown.