Corpus of Russian student texts: design and prospects

N. Zevakhina; S. Dzhakupova

?

Corpus of Russian student texts: design and prospects

Zevakhina N., Dzhakupova S.

The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provides search by two possible explanation factors – typo and construction blending. The perspectives of CoRST development have both computational and research aspects, including qualitative and statistical comparative analysis of language phenomena in CoRST and NRC.

Language: English

Full text

Text on another site

Keywords: ошибки корпусная лингвистика корпус corpus corpus linguistics учебный корпус learner corpus errors error annotation разметка ошибок

Publication based on the results of:

Corpus studies of language variation: from deviations to linguistic norm (2015)

In book

Материалы 21-й Международной конференции по компьютерной лингвистике "Диалог"

М.: Изд-во РГГУ, 2015.

Контаминация конструкций в речи нестандартных русскоговорящих на материале корпуса русских учебных текстов

Пужаева С. Ю., Zevakhina N., Dzhakupova S., В кн.: Труды Международной научной конференции "Корпусная лингвистика-2015".: СПб.: Издательство СПбГУ, 2015. С. 390–397.

The paper examines construction blending as an important cause of errors in written students’ texts. The study is conducted within the framework of Construction Grammar [Fillmore and Kay 1992; Goldberg 1995, 2006] and grammar of errors [Vyrenkova et al. 2014]. It is based on the data of the Corpus of Russian Student Texts supplied with ...

Added: May 20, 2015

Электронные корпуса албанского, калмыцкого, лезгинского и осетинского языков

Arkhangelskiy T., Научно-техническая информация. Серия 2: Информационные процессы и системы 2012 № 4 С. 24–29

Four electronic corpora created in 2011 within the framework of the “Corpus Linguistics: the Albanian, Kalmyk, Lezgian, and Ossetic Languages” Program of Fundamental Research of the RAS are presented. The interface and functionalities of these corpora are described, engineering problems to be solved in their creation are elucidated, and the promises of their development are ...

Added: October 31, 2012

How inter-annotator agreement helps to improve error annotation schemes in learner corpora

Fenogenova A., Kuzmenko E., Olga Vinogradova, , in: TaLC 12 - Teaching and Language Corpora Conference.: [б.и.], 2016. P. 30–34.

The scope and the level of change suggested by an annotator cannot be formally defined, and besides, it is not often that two persons - native speakers or fluent speakers of a foreign language – will not differ in their intuitive perception of what is acceptable in the language. However, if annotators stick to the ...

Added: December 11, 2016

TaLC 12 - Teaching and Language Corpora Conference

[б.и.], 2016.

Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as ...

Added: December 10, 2016

Omnia Russica: Even Larger Russian Corpus

Shavrina T., Benko V., , in: Труды международной конференции "Корпусная лингвистика - 2019".: СПб.: Издательство Санкт-Петербургского университета, 2019. Ch. 13 P. 94–102.

This paper focuses on combining Russian open corpus resources into one single source. The article describes the motivation for gradual integration of existing text resources to create a more general project and analyzes in detail the main steps to merge the existing data to formats based on NoSketch Engine corpus standards and interface. ...

Added: September 9, 2019

Learner Corpora Researches Review (trends observed in the 8th conference CORPUS LINGUISTICS - 2015)

Vinogradova O. I., Journal of Language and Education 2015

The reviewed trends involve primarily the use of learner corpora in teaching and learning foreign languages, and for many authors it implies the context of EFL but for learners with different L1. The researches under investigation fall into four main types - use of learner corpora incorporated into teaching methodology, use of academic learner corpora in the ...

Added: October 12, 2015

Национальный корпус русского языка 2.0: новые возможности и перспективы развития

Савчук С. О., Архангельский Т. А., Bonch-Osmolovskaya A. A. et al., Вопросы языкознания 2024 № 2 С. 7–34

The paper provides an overview of the results of the fundamental reconstruction and modernization project of the National Corpus of the Russian Language platform, carried out from 2020 to 2023. The focus of the paper is on the new opportunities that are opening up for linguists and a wider audience. This includes improving the representativeness ...

Added: March 21, 2024

Аннотирование учебного корпуса в аспекте его использования для исследовательских задач

Klimova M., Viklova A., Overnikova D., В кн.: Современная лингвистика: от теории к практике. III Казанский международный лингвистический саммит (Казань, 14–19 ноября 2022 г.): Труды и материалы, в трёх томах, том 1.: Каз.: Издательство Казанского университета, 2022. С. 46–50.

В данной статье рассматривается классификация ошибок, используемая в учебном корпусе REALEC, в аспекте ее соответствия требованиям и приспособленности для исследовательских задач. ...

Added: January 17, 2023

Цифровой архив литературного журнала с дореформенной орфографией «Отечественные Записки» (1839-1884)

Eugeniya Z., Klyshinskiy E., Voloshina E. et al., Компьютерная лингвистика и интеллектуальные технологии 2021 Т. дополнительный № 20 С. 1239–1244

The paper describes an initial version of the digital archive of the literary magazine with the pre-reform orthography «Otechestvennye Zapiski». Today, the corpus contains 10 XML-volumes of the literary magazine (~ 2 mil. words). The web-application of the digital archive allows users to search for words and lemmas in corpus and to edit magazine’s texts ...

Added: June 6, 2022

Корпус в обучении иностранному языку (на материале английского языка)

Gorina O. G., СПб.: Свое Издательство, 2014.

В настоящем издании наглядно иллюстрируются широкие лингводидактические возможности корпусной лингвистики при обучении профессионально-ориентированному общению на английском языке. Обширный языковой материал специально разработанного корпуса профессионального дискурса и других корпусных ресурсов лег в основу вариативных упражнений, заданий, исследований, которые использовались для развития лексических навыков в устной и письменной речи студентов специальности «Регионоведение». Рекомендуется специалистам – филологам, лингводидактам, ...

Added: February 20, 2017

Looking for contextual cues to differentiating modal meanings: A corpus-based study

Lyashevskaya O., Ovsjannikova M., Szymor N. et al., , in: Quantitative approaches to the Russian language.: Abingdon: Routledge, 2018. P. 51–78.

The domain of modality is structurally diverse and may be described in multiple ways (for example, see Perkins, 1983; Wierzbicka, 1987; Hengeveld, 1988/2004; Sweetser, 1990; Bondarko, 1990; Bybee et al., 1994; van der Auwera and Plungian, 1998; Palmer, 2001; Hansen, 2004; Nuyts, 2006; Khrakovsky, 2007). The article reports on the Russian part of a larger survey ...

Added: October 24, 2017

Корпус как инструмент и как идеология: о некоторых уроках современной корпусной лингвистики

Plungian V., Русский язык в научном освещении 2008 № 16 (2) С. 7–20

Added: November 12, 2023

Прагматические маркеры предикативного типа в устной спонтанной речи представителей разных социальных групп

Zaides K., Социо- и психолингвистические исследования 2020 № 8 С. 40–47

В статье рассматриваются особенности употребления прагматических маркеров предикативного типа (знаешь/те, (я) не знаю, (я) (не) думаю (что), представь/те и т. п.) в устной спонтанной речи представителей разных социальных групп. Материалом для исследования послужил рабочий подкорпус, сформированный из 150 000 токенов корпуса повседневной русской речи (фактически – диалогов) «Один речевой день» и 150 000 токенов корпуса ...

Added: February 3, 2022

ИСПОЛЬЗОВАНИЕ МЕТОДОВ КОМПЬЮТЕРНОЙ ЛИНГВИСТИКИ ДЛЯ АНАЛИЗА ЛИТЕРАТУРЫХ ТЕКСТОВ

Аванесян Н. Л., Fokina A., Chepovskiy A., В кн.: Инжиниринг предприятий и управление знаниями (ИП&УЗ-2024) : сборник научных трудов XXVII Российской научной конференции. 28–29 ноября 2024 г. / под науч. ред. Ю. Ф. Тельнова. – Москва : ФГБОУ ВО «РЭУ им. Г. В. Плеханова», 2024.: М.: ФГБОУ ВО "РЭУ им. Г.В. Плеханова", 2024. С. 15–18.

Статья посвящена применению математических методов корпусного анализа для исследований литературных текстов. На примере созданных корпусов продемонстрированы возможности применения метода анализа соответствий и анализ коэффициентов попарной ранговой корреляции для сравнения частотных характеристик текстов различных подкорпусов. Описанные методики дают коррелированные результаты. Они могут использоваться как для лингвистических исследований, так и создания корректных обучающих текстовых наборов для задач искусственного интеллекта. ...

Added: December 19, 2024

An overview of morphosyntactic variation in the speech of Russian-Chuvash bilinguals: number, gender, case assignment and preposition drop

Grishanova A., Russian linguistics 2025 Vol. 49 Article 10

The purpose of this study is to present a summary of morphosyntactic variation and a detailed analysis of the phenomenon of preposition drop in the Russian speech of Chuvash bilinguals. Specifically, I investigate what underlying factors might condition the variation. I conduct a qualitative analysis of the data extracted from the corpus of Russian spoken ...

Added: July 10, 2025

Russian predicates selecting remarkable clauses: corpus-based approach and Gricean perspective

Zevakhina N., Dainiak A., , in: Bridging Formal and Conceptual Semantics: Selected Papers of the BRIDGE Workshop 14, Studies in Language and Cognition 4.: Dusseldorf University Press, 2017. P. 187–208.

This paper reports upon the study of the lexico-grammatical distribution of Russian matrix predicates selecting kakoj remarkable clauses (or so-called ‘embedded’ exclamatives) in the Russian National Corpus, with some cross-linguistic parallels. It reveals that Russian matrix predicates belong to four conceptual classes: perceptual, mental, emotive, and speech. It shows that the phenomenon of ‘embedded’ exclamatives ...

Added: March 8, 2016

Corpora as indicators of (non-)existence

Piperski A., , in: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015).: М.: Изд-во РГГУ, 2015. P. 494–500.

This paper discusses the notions of acceptability, occurrence, grammaticality and existence, and focuses on the relationship between corpus linguistics and the question of the existence of lexical items. Since corpora are almost exclusively samples from larger populations, it is claimed that they cannot provide evidence for non-existence of words, collocations or constructions. This is because ...

Added: March 13, 2016

Корпусные исследования особенностей речи нестандартных говорящих ("херитажный русский")

Rakhilina E. V., Марушкина А. С., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2015 Т. XI № 1 С. 621–639

The paper presents an analysis of comparative, conditional and prepositional constructions in the speech of heritage speakers of Russian and learners of Russian as a second language on the material from the Russian Learner Corpus. ...

Added: July 25, 2015

Когнитивный термин «фрейм»: создание словарной статьи на базе специализированного текстового корпуса

Khomenko A., Куликова В. А., Babiy A. et al., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2022 Т. 20 № 4 С. 17–34

The study is devoted to the testing of a specialized texts corpus on the example of a group of cognitive linguistics terms with the hypernym frame. The corpus includes a subcorpus of scientific texts and a subcorpus of journalistic texts. The first one is represented by 15 journals indexed in the RSCI; the second one ...

Added: November 17, 2022

Компьютерные методы анализа для определения гендерной принадлежности текста. Опыт практического исследования

Khomenko A., В кн.: Когнитивно-дискурсивная парадигма в лингвистике и смежных науках: современные проблемы и методология исследования: материалы Х Международного конгресса по когнитивной лингвистике. 17–20 сентября 2020 г.Т. 2(41).: Уральский государственный педагогический университет, 2020. С. 893–897.

В настоящей статье речь пойдет о применении интегративного подхода к определению гендера в рамках решения задач судебной лингвистики. Автор интегрирует методы когнитивной науки, корпусной и, шире, компьютерной лингвистики, а также классический структурный анализ текста для идентификации характеристик мужской и женской речи. ...

Added: August 11, 2021

Referential Choice: Predictability and Its Limits

Kibrik A. A., Khudyakova M., Dobrov G. B. et al., Frontiers in Psychology 2016 Vol. 7 No. 1429 P. 1–21

We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, ...

Added: September 28, 2016

Прогностическая валидность глагольных форм длительного аспекта в корпусной лингвистике английского языка

Popkova E., Социосфера 2010 № 4 С. 74–81

The article discusses the most recent trends in the development of the English progressive. A corpus-based approach to linguistic research is seen as an effective means of determining reliability of the data retrieved and helps track the major diachronic dynamic in the increasing frequency of the progressive aspect that has taken place since the beginning ...

Added: November 6, 2012

The Second Genitive in Russian

Daniel M., , in: Partitive cases and related categories.: Berlin, NY: De Gruyter Mouton, 2014. Ch. 9 P. 347–377.

This paper is an overview of the so-called second genitive in Russian, a nominal form available for a minority of Russian nouns but widely used with these nouns in certain contexts. In many ways, the second genitive is a secondary case. Thus, it may always be substituted with a regular genitive form, while the opposite ...

Added: October 17, 2013

USE OF LEARNER CORPUS IN GENERAL ENGLISH AND ACADEMIC ENGLISH COURSES AT THE HIGHER SCHOOL OF ECONOMICS

Vinogradova O. I., , in: Conference Proceedings. The Future of Education International Conference The Future of Education, 6th edition.: Padova: libreriauniversitaria, 2016. P. 310–314.

There have been many reports on advances in the development of learner corpora that have made it possible to effectively use these collections of texts for the benefit of the learning process. This paper lists all possible applications in English courses taught to Bachelor students of a middle-size learner corpus REALEC, which comprises student written ...

Added: March 1, 2017