Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

?

Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

P. 77–88.

REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing

Publication based on the results of:

Automated Detection of Writing Inaccuracies for Students of English in Russia (2021)

In book

Text, Speech, and Dialogue. 25th International Conference, TSD 2022, Brno, Czech Republic, September 6–9, 2022, Proceedings Lecture Notes in Computer Science (LNAI), vol. 13502

Vol. 13502. , Cham: Springer Publishing Company, 2022.

Correcting or Rewriting? An Expert Evaluation of LLM-Based GEC on Academic Learner Data

Копылова Е. В., Tsegoeva O. G., Берлин В. А. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Выпуск 24.Issue 24.: M.: Max press, 2026. P. 1–10.

This paper investigates how large language models correct complex grammatical errors in Russian academic learner writing. Unlike traditional minimal-edit GEC systems, LLMs often apply generative rewriting strategies that may improve fluency, but risk structural overcorrection and semantic drift. We introduce a new expert benchmark derived from an authentic 3,1M-word learner corpus and construct an evaluation set annotated for ...

Added: June 27, 2026

Syntactic complexity measures as linguistic correlates of proficiency level in learner Russian

Kisselev O., Klimov A., Mihail Kopotev, , in: Complexity, Accuracy and Fluency in Learner Corpus Research. Volume vi.: Amsterdam: John Benjamins Publishing Company, 2022. Ch. 3 P. 51–80.

The study reports on the results of a corpus-based evaluation of automatically extracted syntactic complexity measures as indices of Russian as a foreign language (FL) and Russian as a heritage language (HL) writing development. A list of 12 syntactic complexity measures was tested on a set of longitudinal, classroom-based data. The analyses demonstrated that the ...

Added: November 25, 2024

Distractor Generation for Lexical Questions Using Learner Corpus Data

Nikita Login, Jazykovedny Casopis 2023 Vol. 74 No. 1 P. 345–356

Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of ...

Added: September 16, 2024

Обработка слов с частотными орфографическими ошибками (исследование на базе учебного корпуса английского языка)

Klimova M., Viklova A., Overnikova D., Вестник Санкт-Петербургского университета. Язык и литература 2023 Т. 20 № 4 С. 824–837

The article presents an experimental study of the influence of the frequency of spelling errors in a word on its representation in mental lexicon. The hypothesis that frequently misspelled words cause difficulties in reading even if they are written correctly has been proved for native speakers of Russian and English. This paper aims to check ...

Added: January 26, 2024

Устный учебный корпус РКИ: новый источник данных для лингвистических и методических исследований

Vlasova E., Бец Ю. В., Северина Е. М., В кн.: «Русская грамматика в диалоге научных школ, направлений, методов».: Владивосток: Издательство ДВФУ, 2022.

В статье анализируются нетривиальные фонетические и грамматические явления устной речи иностранцев, изучающих русский язык. Показано, что устный учебный корпус позволяет получить систематическое представление о компенсаторных механизмах речепорождения, проверять и формулировать гипотезы. ...

Added: November 8, 2023

Аннотирование учебного корпуса в аспекте его использования для исследовательских задач

Klimova M., Viklova A., Overnikova D., В кн.: Современная лингвистика: от теории к практике. III Казанский международный лингвистический саммит (Казань, 14–19 ноября 2022 г.): Труды и материалы, в трёх томах, том 1.: Каз.: Издательство Казанского университета, 2022. С. 46–50.

В данной статье рассматривается классификация ошибок, используемая в учебном корпусе REALEC, в аспекте ее соответствия требованиям и приспособленности для исследовательских задач. ...

Added: January 17, 2023

Clausal complexity of expert and student writing: a corpus-based analysis of papers in social sciences

Smirnova E. A., Language Learning in Higher Education 2022 Vol. 12 No. 2 P. 453–475

Syntactic complexity has been extensively approached in the fields of corpus linguistics and academic discourse studies. However, works focusing on disciplinary variation in terms of linguistic complexity and comparison of professional and novice academic writing are scarce. Addressing these issues is likely to have important implications for EAP/ESP practitioners in terms of selection of target ...

Added: December 7, 2022

Рragmatic Markers in the Corpus “Оne Day of Speech”: Approaches to the Annotation

Zaides K., Popova T., Bogdanova-Beglarian Natalia, , in: Proceedings of Computational Models in Language and Speech Workshop (CMLS 2018) co-located with the 15th TEL International Conference on Computational and Cognitive Linguistics (TEL-2018)Vol. 2303: Computational Models in Language and Speech 2018.: Kazan: CEUR Workshop Proceedings, 2018. P. 128–143.

Added: February 3, 2022

Об унификации разметки корпуса «Сбалансированная аннотированная текстотека»

Zaides K., В кн.: Труды международной конференции «Корпусная лингвистика-2019».: Издательство Санкт-Петербургского государственного университета, 2019. С. 332–339.

Доклад посвящен процессу и результатам унификации разметки корпуса «Сбалансированная аннотированная текстотека». Данный корпус состоит из нескольких отдельных блоков, репрезентирующих устную речь представителей разных социальных и психологических групп. Для дальнейших лингвистических исследований, а также в целях сравнения данных, полученных на материале иных корпусов, необходимо было унифицировать систему разметки корпуса. На текущем этапе производилась замена основных знаков транскрипции, отмечающих особые явления, свойственные ...

Added: February 3, 2022

К вопросу о формировании набора отношений для корпуса с дискурсивной разметкой текста

Соколова Е. Г., Toldova S., Компьютерная лингвистика и вычислительные онтологии 2020 № 4 С. 44–53

The work discusses the problem of discourse annotation and the consequences of the relations set simplification for the sake of higher interannotator agreement. One of the theoretical approaches to discourse structure representation is the Rhetoric Structure Theory by William Mann and Sandra Thompson [1]. There is a set of rhetoric relations between discourse units that ...

Added: November 17, 2021

Discourse features of blogs in subcorpus of Russian Ru-RSTreebank

Toldova S., Davydova T., Kobozeva M. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной международной конференции «Диалог» (Москва, 17–20 июня 2020 г.)Issue 19(26): дополнительный том.: -, 2020. P. 747–761.

The paper presents a corpus study of the discourse features in the corpus of blogs. It is based on the data of Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The Ru-RSTreebank represents genres of news and popular science, scientific papers, and blogs texts. Blog subcorpus contains such topics as ...

Added: November 17, 2021

Кластеризация данных, извлечение ключевых слов и лексическое разнообразие в текстах эссе учебного корпуса

Scherbakova A., В кн.: Межкультурное пространство: лингвистический и дидактический аспекты. Материалы секций "Межкультурная лингвистика", "Межкультурная транслатология" и студенческого научного форума. Пленарное заседание и секция «Межкультурная дидактика».Ч. 2.: Издательство ПетрГУ, 2021.

The paper focuses on the task of clustering essays produced by ESL (English as a Second Language) learners. The data was taken from a learner corpus REALEC. The division of texts by certain characteristics can be useful to speed up the analysis of a single corpus or access to the necessary sections of a large ...

Added: September 30, 2021

Автоматическое обнаружение и исправление деривационных ошибок в письменной речи на русском как иностранном

Vyrenkova A. S., Смирнов И. Ю., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2021 Т. 19 № 3 С. 57–68

Learner corpora serve as one of the most valuable sources of statistical data on learners' errors. For instance, data from foreign-language learners’ corpora can be used for the Second Language Acquisition research. However, corpora representativity strongly depends on the quality of its error markup, which is most frequently carried out manually and thus presents a ...

Added: September 24, 2021

Межъязыковая интерференция при выборе видо-временных форм английских глаголов в эссе русскоязычных студентов: корпусное исследование

Vinogradova O. I., Viklova A., В кн.: Межкультурное пространство: лингвистический и дидактический аспектыЧ. 2: Материалы секций «Межкультурная лингвистика», «Межкультурная транслатология» и студенческого научного форума.: Петрозаводск: Издательство ПетрГУ, 2021. С. 17–27.

Added: July 7, 2021