Building a learner corpus for Russian
In this paper we describe an open learner corpus of Russian. The Russian Learner Corpus (RLC) is the first corpus with clear distinction between foreign language learners and heritage speakers. We discuss the structure of the corpus, its development and the annotation principles. This paper describes the platform of the RLC which combines online tools for text uploading, processing, error annotation and corpus search.
The article gives an overview of mistakes made by a peculiar type of speakers – children of emigrants from Russia who grew up in a foreign linguistic environment and inherited their Russian from their parents. English tradition refers to this variety of Russian as heritage Russian. The study is based on the data from the Russian Learner Corpus, which includes texts produced by children of emigrants to the USA. The results show that the mistakes made by this type of speakers are different from those made by both common speakers of Russian and L2 students, and the process of their emergence is of significant linguistic interest.
The conference was organised under the aegis of the Learner Corpus Association and was hosted by Eurac Research Institute for Applied Linguistics. It was themed "Widening the scope of learner corpus research" and brought together researchers and language teachers, software developers and linguists from 23 countries around the world.
The project we present – Russian Learner Translator Corpus (RusLTC) is a multiple learner translator corpus which stores Russian students’ translations out of English and into it. The project is being developed by a cross-functional team of translator trainers and computational linguists in Russia. Translations are collected from several Russian universities; all translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. As of March 2014 RusLTC contains the total of nearly 1.2 million word tokens, 258 source texts, and 1,795 translations. The paper gives a brief overview of the related research, describes the corpus structure and corpus-building technologies used; it also covers the query tool features and our error annotation solutions. In the final part we make a summary of the RusLTC-based research, its current practical applications and suggest research prospects and possibilities.
The paper discusses case (non-)coincidence in elliptical coordinated constructions, which is one of the most wide-spread type of errors that Russian native speaker make.
The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (CALL) – NLP4CALL – is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection.
The paper describes the learner corpus composed of English essays written by native Russian speakers. REALEC (Russian Error-Annotated Learner English Corpus) is an error-annotated, available online corpus, now containing more than 200 thousand word tokens in almost 800 essays. It is one of the first Russian ESL corpora, dynamically developing and striving to improve both in size and in features offered to users. We describe our perspective on the corpus, data sources and tools used in compiling it. Elaborate self-made classification of learners’ errors types is thoroughly described. The paper also presents a pilot experiment on creating test sets for particular learners’ problems using corpus data.
What is the language distribution among migrant children in different domains? Which factors influence the relationship between the majority and dominant language? Do second-generation migrants experience problems with linguistic shift? The work also considers data from schoolchildren surveyed by the Sociology of Education and Science Laboratory at the National Research University at the Higher School of Economics from 2009-2010 (around 7,500 surveys of high school pupils, continuous sampling in schools)
Second language (L2) speakers often experience difficulty discriminating speech sounds of the nonnative language, which can result in phonolexical ambiguity. We report two experiments that examine how L2 Russian speakers may utilize contextual constraints for phonolexical ambiguity resolution during speech comprehension. L2 ambiguous words constitute minimal pairs with palatalized and unpalatalized consonants in the Russian language, where the phonological feature of palatalization marks semantic, morphological, or syntactic distinctions between words. L2 performance is compared to that of a control group of Russian native speakers. The results demonstrate that L2 listeners rely on contextual information for meaning disambiguation during sentence comprehension, but that the relative reliance on different types of context is task specific.