Approaches to selection of features for automated feedback for student writing
The role of access to a learner corpus has proved to increase efficiency of L2 acquisition for learners as well as teaching efficiency for EFL instructors. This paper presents a computer tool for a learner corpus designed at the School of Linguistics of the Higher School of Economics for both categories of users. REALEC, Russian Error-Annotated Learner English Corpus, set up at the School of Linguistics, is the first collection of English texts written by Russian students learning English available in the open access. All errors made by Russian students in their academic writing in English are pointed out to them with special tags by expert annotators (EFL instructors, as a rule). The annotation process is controlled by the research team responsible for consistency in tagging, as well as for the development of the learner corpus. One of the directions of the development is to look at the lexical features used in student essays. Our approach in this research was to find such lexical features in the essays scored highly by experts which will be significantly different from those features in the essays scored with the lowest grades.
Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents the first steps taken by Russian corpus linguistics toward the development of language corpora and corpus-based resources as well as their use in grammatical and lexical analysis.
The first part of the book focuses on the annotation of Russian texts at several levels: lemmas, part of speech and inflectional forms, word formation, lexical-semantic classes, syntactic dependencies, semantic roles, frames, and lexical constructions. We discuss various theoretical principles and practical considerations motivating the corpus markup design, provide details on the creation of lexical resources (electronic dictionaries and databases) and text processing software, and consider complicated cases that present challenges for the annotation of corpora both manually and automatically. In most cases we describe the annotation of the Russian National Corpus (RNC, ruscorpora.ru) and its affiliate project FrameBank (framebank.ru).
Frequency data depend not only on the representativeness and balance of texts in a corpus, but also on the rules and tools used for annotation. The book addresses the development of evaluation standards for Russian NLP resources, namely, morphological taggers and dependency parsers. In addition, the book presents several experiments on automatic annotation and disambiguation: lemmatization of word forms not in the dic- tionary; word sense disambiguation based on vectors formed by lexical, semantic and grammatical cues of context; and semantic role labeling.
The final chapters of the first part of the book outline two types of frequency dictionaries based on the RNC data: a general-purpose frequency dictionary and a lexico-grammatical one.
The second part of the book presents an analysis of corpus data and includes a number of case studies of Russian grammar and lexical-grammatical interaction using quantitative methods. The key concept underlying our analysis is the behavioral profile (Hanks 1996; Divjak, Gries 2006), which is the frequency distribution of variable elements in a linguistic unit as attested in a corpus. This covers grammatical profiles (the frequency distribution of inflected forms of a word), constructional profiles (the frequency distri- bution of argument or any other constructions attested for a key predicate), lexical and semantic profiles (the frequency distribution of words and lexical-semantic classes in construction slots or, more generally, in the context of a word), and radial category profiles (the frequency distribution of word senses and word uses across the radial category network of a polysemous unit). We use grammatical, constructional, semantic, and radial category profiling to study tense, aspect and mood specialization of Russian verb forms; to identify singular-oriented and plural-oriented nouns; to investigate factors for prefix choice and prefix variation in natural perfectives (chistovidovye perfectivy); to analyze constraints on the filling of slots in a construction and how this affects the meaning of the construction, taking as an example the Genitive construction of shape and the spatial construction with the preposition poverkh ‘up and over’.
The quantitative corpus-based techniques used for the analysis vary from simple descriptive statistics (e. g., absolute frequencies, percentages, measures of the central ten- dency and outliers) to exact Fisher test and logistic regression. We claim that the vector modeling approaches to quantitative grammatical studies in theoretical linguistics are no less effective than in computational linguistics, where they have become a standard tool.
The paper describes the learner corpus composed of English essays written by native Russian speakers. REALEC (Russian Error-Annotated Learner English Corpus) is an error-annotated, available online corpus, now containing more than 200 thousand word tokens in almost 800 essays. It is one of the first Russian ESL corpora, dynamically developing and striving to improve both in size and in features offered to users. We describe our perspective on the corpus, data sources and tools used in compiling it. Elaborate self-made classification of learners’ errors types is thoroughly described. The paper also presents a pilot experiment on creating test sets for particular learners’ problems using corpus data.
The project we present – Russian Learner Translator Corpus (RusLTC) is a multiple learner translator corpus which stores Russian students’ translations out of English and into it. The project is being developed by a cross-functional team of translator trainers and computational linguists in Russia. Translations are collected from several Russian universities; all translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. As of March 2014 RusLTC contains the total of nearly 1.2 million word tokens, 258 source texts, and 1,795 translations. The paper gives a brief overview of the related research, describes the corpus structure and corpus-building technologies used; it also covers the query tool features and our error annotation solutions. In the final part we make a summary of the RusLTC-based research, its current practical applications and suggest research prospects and possibilities.
The paper discusses case (non-)coincidence in elliptical coordinated constructions, which is one of the most wide-spread type of errors that Russian native speaker make.
The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (CALL) – NLP4CALL – is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection.
On the body material in the article, common errors in the use and construction of the verb form are considered - from the theoretical and typological points of view. The data of the RLC educational building containing texts of students of the Russian language as a foreign language are used. Identified "weaknesses" in the assimilation of this topic by foreign students. An attempt was made to create a typology of errors. It is shown that the observed errors in the formation of verbs are generally expected; they are also confirmed by the result of the study. The article analyzes the possibilities of using the Russian educational building (RUK, Russian Learner Corpus - RLC) in the practice of teaching RCTs while studying the theme "verb type". A total of 900 examples of verbs are considered; errors in the verbs of the present, past and future tense are noted in 330 of them. The aim of the study is to clarify the derivation rules for the verb type and the rules for using species, as well as to analyze the types of errors issued by the academic building and to establish a correlation between the rule and usage.
The paper discusses evaluation techniques for semantic role labeling in Russian. It has been shown that the quality of FrameNet-style semantic role labeling largely depends on the quantity of roles and may decrease if the inventory of roles in the training set differs from that in the output resource. Our study is the first step towards the ‘smart’ evaluation tool which would introduce linguistically relevant criteria to evaluation; be able to put the mistakes on a scale from minor to critical ones; make evaluation easier in case the grid of roles varies.
We run an experiment based on the data from the Russian FrameBank, a FrameNet-oriented open access database which includes a dictionary of Russian lexical constructions and a corpus of tagged examples. The semantic role is one of the parameters that define the predicate-argument patterns in FrameBank. The inventory of roles is modeled hierarchically and
forms a graph. We explore the cases when the role induced by the system and the answer of the gold standard do not match. We analyze the statistical criteria of distribution of roles in the patterns and the distance between the source and the target in the graph of roles as a mean to assess the goodness of fit.
The paper is focused on the study of reaction of italian literature critics on the publication of the Boris Pasternak's novel "Doctor Jivago". The analysys of the book ""Doctor Jivago", Pasternak, 1958, Italy" (published in Russian language in "Reka vremen", 2012, in Moscow) is given. The papers of italian writers, critics and historians of literature, who reacted immediately upon the publication of the novel (A. Moravia, I. Calvino, F.Fortini, C. Cassola, C. Salinari ecc.) are studied and analised.
In the article the patterns of the realization of emotional utterances in dialogic and monologic speech are described. The author pays special attention to the characteristic features of the speech of a speaker feeling psychic tension and to the compositional-pragmatic peculiarities of dialogic and monologic text.