Automatic dependency parsing of a learner English corpus REALEC
There have been many reports on advances in the development of learner corpora that have made it possible to effectively use these collections of texts for the benefit of the learning process. This paper lists all possible applications in English courses taught to Bachelor students of a middle-size learner corpus REALEC, which comprises student written works supplied with expert annotation of mistakes, browsing and search options, and some optional automated tagging system. Annotation in the corpus is given by either experts (mostly, EFL instructors), or by learners themselves under the supervision of their EFL instructors. As the first point, the paper argues that when EFL methodology requires that students apply the error classification in the process of annotating their peers’ essays and gradually their own essays as well, their understanding of subtle areas of grammar, vocabulary and discourse improves, and correspondingly, the number of errors in their written works decreases. The second argument concerns the tool for the development of placement and progress tests, which makes use of sentences with mistakes made by other learners – contributors to the corpus. In the suggested design of the tests sentences are automatically extracted from the same corpus, manually divided into three echelons according to the complexity of the change required in the correction of the mistake, and then administered to learners as a way of automated measurement of their proficiency in English. The submitted test is scored automatically within minutes. The third possibility considered in the research is the possibility to supplement the corpus with the platform of trainers automatically or semi-automatically set up on the basis of frequently marked errors made by a particular group of students. In conclusion we point out the ease and usefulness of the proposed applications both for EFL instructors and English learners.
The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single format (Universal Dependencies CONLL-U). The sources of the data were the disambiguated subcorpus of the Russian National Corpus, SynTagRus, OpenCorpora.org data and GICR corpus with the resolved homonymy, all exhibiting different tagsets, rules for lemmatization, pipeline architecture, technical solutions and error systematicity. The collections includes both normative texts (the news and modern literature) and more informal discourse (social media and spoken data), the texts are available under CC BY-NC-SA 3.0 license.
Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as an alternative to a unique, native norm, a range of norms are available Some of these norms may be problematic if they are not selected carefully (depending on the learner corpus, the purpose of the comparison, etc.) and handled cautiously. Different choices of norms may produce different results and thus lead to different conclusions with respect to learners’ usages. Pedagogical implications of such choices are to be examined, with particular emphasis on whether all differences between the learner corpus and the reference corpus should be targeted for teaching intervention. Problems in evaluating agreement in approaches to annotation practices are considered as well.
The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provides search by two possible explanation factors – typo and construction blending. The perspectives of CoRST development have both computational and research aspects, including qualitative and statistical comparative analysis of language phenomena in CoRST and NRC.
The paper is dedicated to the initiative of universal dependences (UD), with aim to develop cross-linguistically consistent annotation scheme of grammatical analysis. The purpose of this initiative is in simplification of cross-language research, unification of interlanguage linguistic typology, building a foundation for the automated multilingual systems and the universal cross-language text parser.
In the first part of the paper we describe the main problems of grammatical analysis of the multilingual text, advantages of unification of language features, the purposes of the project of universal dependences. Also we give the brief history of creation of the project. On the example of three languages – Russian, English and German we discusses the basic principles of universal dependences, such as morphology and syntax features.
In the second part of the article on the example of predicative we illustrate how to conduct corpus researches using UD. The article defines the technique of automatic identification of predicatives and examines their frequency distribution in the Russian UD corpus and a semantic categorization of the most often used predicatives.
The present article continues the investigation of the Soqotri verbal system undertaken by the Russian-Soqotri fieldwork team. The article focuses on the so-called “weak” and “geminated” roots in the basic stem. The investigation is based on the analysis of full paradigms (perfect, imperfect and jussive) of more than 170 “weak” and “geminated” Soqotri verbs.