Automatic dependency parsing of a learner English corpus REALEC
In this article I present a connection between frequency and length of person-number indexes via two independent researches: token frequency obtained from the Universal Dependencies’ treebanks and type frequency gathered within a typological study. After introducing the results of those two studies, I will present East Caucasian data. I show that the unusual history of person-number indexes in these languages leads to violations of the tendencies.
The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.
The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.
Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as an alternative to a unique, native norm, a range of norms are available Some of these norms may be problematic if they are not selected carefully (depending on the learner corpus, the purpose of the comparison, etc.) and handled cautiously. Different choices of norms may produce different results and thus lead to different conclusions with respect to learners’ usages. Pedagogical implications of such choices are to be examined, with particular emphasis on whether all differences between the learner corpus and the reference corpus should be targeted for teaching intervention. Problems in evaluating agreement in approaches to annotation practices are considered as well.
The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provides search by two possible explanation factors – typo and construction blending. The perspectives of CoRST development have both computational and research aspects, including qualitative and statistical comparative analysis of language phenomena in CoRST and NRC.
The paper is dedicated to the initiative of universal dependences (UD), with aim to develop cross-linguistically consistent annotation scheme of grammatical analysis. The purpose of this initiative is in simplification of cross-language research, unification of interlanguage linguistic typology, building a foundation for the automated multilingual systems and the universal cross-language text parser.
In the first part of the paper we describe the main problems of grammatical analysis of the multilingual text, advantages of unification of language features, the purposes of the project of universal dependences. Also we give the brief history of creation of the project. On the example of three languages – Russian, English and German we discusses the basic principles of universal dependences, such as morphology and syntax features.
In the second part of the article on the example of predicative we illustrate how to conduct corpus researches using UD. The article defines the technique of automatic identification of predicatives and examines their frequency distribution in the Russian UD corpus and a semantic categorization of the most often used predicatives.
In this paper we consider choice problems under the assumption that the preferences of the decision maker are expressed in the form of a parametric partial weak order without assuming the existence of any value function. We investigate both the sensitivity (stability) of each non-dominated solution with respect to the changes of parameters of this order, and the sensitivity of the set of non-dominated solutions as a whole to similar changes. We show that this type of sensitivity analysis can be performed by employing techniques of linear programming.