• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 25, 2026
HSE Researchers Make Aldehydes Perform Dual Function
Chemists from HSE University have discovered a way to carry out a reductive addition reaction without using an external reducing agent. Instead, the required 'resource' is supplied by the aldehyde itself, one of the reaction participants. This approach helps prevent unwanted side reactions, reduces toxicity, and simplifies the production and synthesis of organic molecules, including those used in the manufacture of medicines. The study has been published in Journal of Catalysis.
June 25, 2026
HSE Scientists Explain Why Findings in Autism Research Differ
Researchers from the Cognitive Health and Intelligence Centre at HSE University conducted the first-ever systematic review of studies on the specifics of emotion-from-motion perception in autism. The review showed that differences found between autistic and non-autistic individuals are largely associated with the experimental design and the types of tasks given to study participants. The review findings have been published in Research in Autism.
June 22, 2026
‘In Science, You Are Your Own Boss
Polina Nasledskova is interested in identifying gaps in linguistics and topics that have been overlooked by other researchers. In an interview for the  Young Scientists of HSE University project, she spoke about rare ordinal numerals in Nakh-Daghestanian languages, the benefits of knitting for concentration, and the beauty of the Patriarshy Bridge.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC

P. 77–88.
Vinogradova O. I., Lyashevskaya O.

REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. This provides the basis for the ongoing development of automated annotation for the new portions of learner texts in the corpus. The observations in the first part were made on the reliability of the total of 134,608 error tags manually annotated across the texts in the corpus. Some examples are given in the paper to emphasize the role of the interference with learners’ L1 (Russian), one more direction of the future corpus research. A number of studies carried out by the research team working on the basis of the REALEC data are listed as examples of the research potential that the corpus has been providing

Language: English
Full text
DOI
Text on another site
Keywords: разметка корпусаcorpus annotationучебный корпусlearner corpusосвоение первого и второго языка (L1 и L2)learner academic writing in EnglishL1 Russianerror taxonomyучебное академическое письмо на английском языкетаксономия ошибок
Publication based on the results of:
Automated Detection of Writing Inaccuracies for Students of English in Russia (2021)

In book

Text, Speech, and Dialogue. 25th International Conference, TSD 2022, Brno, Czech Republic, September 6–9, 2022, Proceedings Lecture Notes in Computer Science (LNAI), vol. 13502
Vol. 13502. , Cham: Springer Publishing Company, 2022.
Similar publications
Correcting or Rewriting? An Expert Evaluation of LLM-Based GEC on Academic Learner Data
Копылова Е. В., Tsegoeva O. G., Берлин В. А. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Выпуск 24.Issue 24.: M.: Max press, 2026. P. 1–10.
This paper investigates how large language models correct complex grammatical errors in Russian academic learner writing. Unlike traditional minimal-edit GEC systems, LLMs often apply generative rewriting strategies that may improve fluency, but risk structural overcorrection and semantic drift. We introduce a new expert benchmark derived from an authentic 3,1M-word learner corpus and construct an evaluation set annotated for ...
Added: June 27, 2026
Syntactic complexity measures as linguistic correlates of proficiency level in learner Russian
Kisselev O., Klimov A., Mihail Kopotev, , in: Complexity, Accuracy and Fluency in Learner Corpus Research. Volume vi.: Amsterdam: John Benjamins Publishing Company, 2022. Ch. 3 P. 51–80.
The study reports on the results of a corpus-based evaluation of automatically extracted syntactic complexity measures as indices of Russian as a foreign language (FL) and Russian as a heritage language (HL) writing development. A list of 12 syntactic complexity measures was tested on a set of longitudinal, classroom-based data. The analyses demonstrated that the ...
Added: November 25, 2024
Distractor Generation for Lexical Questions Using Learner Corpus Data
Nikita Login, Jazykovedny Casopis 2023 Vol. 74 No. 1 P. 345–356
Learner corpora with error annotation can serve as a source of data for automated question generation (QG) for language testing. In case of multiple choice gapfill lexical questions, this process involves two steps. The first step is to extract sentences with lexical corrections from the learner corpus. The second step, which is the focus of ...
Added: September 16, 2024
Обработка слов с частотными орфографическими ошибками (исследование на базе учебного корпуса английского языка)
Klimova M., Viklova A., Overnikova D., Вестник Санкт-Петербургского университета. Язык и литература 2023 Т. 20 № 4 С. 824–837
The article presents an experimental study of the influence of the frequency of spelling errors in a word on its representation in mental lexicon. The hypothesis that frequently misspelled words cause difficulties in reading even if they are written correctly has been proved for native speakers of Russian and English. This paper aims to check ...
Added: January 26, 2024
Устный учебный корпус РКИ: новый источник данных для лингвистических и методических исследований
Vlasova E., Бец Ю. В., Северина Е. М., В кн.: «Русская грамматика в диалоге научных школ, направлений, методов».: Владивосток: Издательство ДВФУ, 2022.
В статье анализируются нетривиальные фонетические и грамматические явления устной речи иностранцев, изучающих русский язык. Показано, что устный учебный корпус позволяет получить систематическое представление о компенсаторных механизмах речепорождения, проверять и формулировать гипотезы. ...
Added: November 8, 2023
Аннотирование учебного корпуса в аспекте его использования для исследовательских задач
Klimova M., Viklova A., Overnikova D., В кн.: Современная лингвистика: от теории к практике. III Казанский международный лингвистический саммит (Казань, 14–19 ноября 2022 г.): Труды и материалы, в трёх томах, том 1.: Каз.: Издательство Казанского университета, 2022. С. 46–50.
В данной статье рассматривается классификация ошибок, используемая в учебном корпусе REALEC, в аспекте ее соответствия требованиям и приспособленности для исследовательских задач. ...
Added: January 17, 2023
Clausal complexity of expert and student writing: a corpus-based analysis of papers in social sciences
Smirnova E. A., Language Learning in Higher Education 2022 Vol. 12 No. 2 P. 453–475
Syntactic complexity has been extensively approached in the fields of corpus linguistics and academic discourse studies. However, works focusing on disciplinary variation in terms of linguistic complexity and comparison of professional and novice academic writing are scarce. Addressing these issues is likely to have important implications for EAP/ESP practitioners in terms of selection of target ...
Added: December 7, 2022
Рragmatic Markers in the Corpus “Оne Day of Speech”: Approaches to the Annotation
Zaides K., Popova T., Bogdanova-Beglarian Natalia, , in: Proceedings of Computational Models in Language and Speech Workshop (CMLS 2018) co-located with the 15th TEL International Conference on Computational and Cognitive Linguistics (TEL-2018)Vol. 2303: Computational Models in Language and Speech 2018.: Kazan: CEUR Workshop Proceedings, 2018. P. 128–143.
Added: February 3, 2022
Об унификации разметки корпуса «Сбалансированная аннотированная текстотека»
Zaides K., В кн.: Труды международной конференции «Корпусная лингвистика-2019».: Издательство Санкт-Петербургского государственного университета, 2019. С. 332–339.
Доклад посвящен процессу и результатам унификации разметки корпуса «Сбалансированная аннотированная текстотека». Данный корпус состоит из нескольких отдельных блоков, репрезентирующих устную речь представителей разных социальных и психологических групп. Для дальнейших лингвистических исследований, а также в целях сравнения данных, полученных на материале иных корпусов, необходимо было унифицировать систему разметки корпуса. На текущем этапе производилась замена основных знаков транскрипции, отмечающих особые явления, свойственные ...
Added: February 3, 2022
К вопросу о формировании набора отношений для корпуса с дискурсивной разметкой текста
Соколова Е. Г., Toldova S., Компьютерная лингвистика и вычислительные онтологии 2020 № 4 С. 44–53
The work discusses the problem of discourse annotation and the consequences of the relations set simplification for the sake of higher interannotator agreement. One of the theoretical approaches to discourse structure representation is the Rhetoric Structure Theory by William Mann and Sandra Thompson [1]. There is a set of rhetoric relations between discourse units that ...
Added: November 17, 2021
Discourse features of blogs in subcorpus of Russian Ru-RSTreebank
Toldova S., Davydova T., Kobozeva M. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной международной конференции «Диалог» (Москва, 17–20 июня 2020 г.)Issue 19(26): дополнительный том.: -, 2020. P. 747–761.
The paper presents a corpus study of the discourse features in the corpus of blogs. It is based on the data of Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The Ru-RSTreebank represents genres of news and popular science, scientific papers, and blogs texts. Blog subcorpus contains such topics as ...
Added: November 17, 2021
Кластеризация данных, извлечение ключевых слов и лексическое разнообразие в текстах эссе учебного корпуса
Scherbakova A., В кн.: Межкультурное пространство: лингвистический и дидактический аспекты. Материалы секций "Межкультурная лингвистика", "Межкультурная транслатология" и студенческого научного форума. Пленарное заседание и секция «Межкультурная дидактика».Ч. 2.: Издательство ПетрГУ, 2021.
The paper focuses on the task of clustering essays produced by ESL (English as a Second Language) learners. The data was taken from a learner corpus REALEC. The division of texts by certain characteristics can be useful to speed up the analysis of a single corpus or access to the necessary sections of a large ...
Added: September 30, 2021
Автоматическое обнаружение и исправление деривационных ошибок в письменной речи на русском как иностранном
Vyrenkova A. S., Смирнов И. Ю., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2021 Т. 19 № 3 С. 57–68
Learner corpora serve as one of the most valuable sources of statistical data on learners' errors. For instance, data from foreign-language learners’ corpora can be used for the Second Language Acquisition research. However, corpora representativity strongly depends on the quality of its error markup, which is most frequently carried out manually and thus presents a ...
Added: September 24, 2021
Межъязыковая интерференция при выборе видо-временных форм английских глаголов в эссе русскоязычных студентов: корпусное исследование
Vinogradova O. I., Viklova A., В кн.: Межкультурное пространство: лингвистический и дидактический аспектыЧ. 2: Материалы секций «Межкультурная лингвистика», «Межкультурная транслатология» и студенческого научного форума.: Петрозаводск: Издательство ПетрГУ, 2021. С. 17–27.
Added: July 7, 2021
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit