Automatic dependency parsing of a learner English corpus REALEC
MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem of unification of various existing training collections for Russian language.
This paper focuses on referential coherence which is seen as a crucial attribute of effective academic writing. I report findings from a corpus study of Russian students' use of anaphoric expressions in their research proposals which is compared to a reference corpus comprising research articles published in peer-reviewed journals. I hypothesise that learners use anaphora less frequently than professional writers. The results of the analysis confirmed the hypothesis and allowed me to identify particular problems connected with the students' use of anaphoric expressions. It is hoped that the reported findings will challenge EAP teachers and textbook writers to consider paying closer attention to the markers of referential coherence in a course of academic writing for L2 students.
Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as an alternative to a unique, native norm, a range of norms are available Some of these norms may be problematic if they are not selected carefully (depending on the learner corpus, the purpose of the comparison, etc.) and handled cautiously. Different choices of norms may produce different results and thus lead to different conclusions with respect to learners’ usages. Pedagogical implications of such choices are to be examined, with particular emphasis on whether all differences between the learner corpus and the reference corpus should be targeted for teaching intervention. Problems in evaluating agreement in approaches to annotation practices are considered as well.
The scope and the level of change suggested by an annotator cannot be formally defined, and besides, it is not often that two persons - native speakers or fluent speakers of a foreign language – will not differ in their intuitive perception of what is acceptable in the language. However, if annotators stick to the decision to restrict corrections to those that they find absolutely necessary to stay within the norm, first, and, second, if for the chosen correction they select tags only for the core change, and not for all the words that have to change as a result of the core change, the variation across annotators is bound to reduce dramatically. Both these requirements accompanied by examples from the corpus are to be included in the REALEC Annotation Manual, and some training based on complicated cases from the experiment described above will be presented to all the annotators.
We have performed analysis of problematic cases of annotators inconsistency to reveal weaknesses and strengths of the annotation scheme.
The paper examines construction blending as an important cause of errors in written students’ texts. The study is conducted within the framework of Construction Grammar [Fillmore and Kay 1992; Goldberg 1995, 2006] and grammar of errors [Vyrenkova et al. 2014]. It is based on the data of the Corpus of Russian Student Texts supplied with metatextual, morphological and error annotation.
The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provides search by two possible explanation factors – typo and construction blending. The perspectives of CoRST development have both computational and research aspects, including qualitative and statistical comparative analysis of language phenomena in CoRST and NRC.
The article examines the main trends in the study of the Stalinist period and the phenomenon of Stalinism in connection with the mass opening of the archives.