Corpus of Russian student texts: design and prospects

N. Zevakhina; S. Dzhakupova

?

Corpus of Russian student texts: design and prospects

Zevakhina N., Dzhakupova S.

The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. CoRST is supplied with metalinguistic, morphological and error annotation that enable to customize subcorpora and search by various error types. Its error annotation is based on the modular classification: lexis, grammar and discourse, within which most frequent error phenomena are further distinguished. In total, the error classification encompasses 39 (20 higher-level and 19 lower-level) error tags. The crucial characteristic of CoRST is that the error annotation is multi-layered. Typically, since an error section can be corrected in a few ways, it is annotated with a few error tags respectively. Moreover, the corpus provides search by two possible explanation factors – typo and construction blending. The perspectives of CoRST development have both computational and research aspects, including qualitative and statistical comparative analysis of language phenomena in CoRST and NRC.

Language: English

Full text

Text on another site

Keywords: ошибки корпусная лингвистика корпус corpus corpus linguistics учебный корпус learner corpus errors error annotation разметка ошибок

Publication based on the results of:

Corpus studies of language variation: from deviations to linguistic norm (2015)

In book

Материалы 21-й Международной конференции по компьютерной лингвистике "Диалог"

М.: Изд-во РГГУ, 2015.

Контаминация конструкций в речи нестандартных русскоговорящих на материале корпуса русских учебных текстов

Пужаева С. Ю., Zevakhina N., Dzhakupova S., В кн.: Труды Международной научной конференции "Корпусная лингвистика-2015". СПб.: Издательство СПбГУ, 2015. С. 390–397.

The paper examines construction blending as an important cause of errors in written students’ texts. The study is conducted within the framework of Construction Grammar [Fillmore and Kay 1992; Goldberg 1995, 2006] and grammar of errors [Vyrenkova et al. 2014]. It is based on the data of the Corpus of Russian Student Texts supplied with ...

Added: May 20, 2015

Электронные корпуса албанского, калмыцкого, лезгинского и осетинского языков

Arkhangelskiy T., Научно-техническая информация. Серия 2: Информационные процессы и системы 2012 № 4 С. 24–29

Four electronic corpora created in 2011 within the framework of the “Corpus Linguistics: the Albanian, Kalmyk, Lezgian, and Ossetic Languages” Program of Fundamental Research of the RAS are presented. The interface and functionalities of these corpora are described, engineering problems to be solved in their creation are elucidated, and the promises of their development are ...

Added: October 31, 2012

Learner Corpora Researches Review (trends observed in the 8th conference CORPUS LINGUISTICS - 2015)

Vinogradova O. I., Journal of Language and Education 2015

The reviewed trends involve primarily the use of learner corpora in teaching and learning foreign languages, and for many authors it implies the context of EFL but for learners with different L1. The researches under investigation fall into four main types - use of learner corpora incorporated into teaching methodology, use of academic learner corpora in the ...

Added: October 12, 2015

How inter-annotator agreement helps to improve error annotation schemes in learner corpora

Fenogenova A., Kuzmenko E., Olga Vinogradova, , in: TaLC 12 - Teaching and Language Corpora Conference. [б.и.], 2016. P. 30–34.

The scope and the level of change suggested by an annotator cannot be formally defined, and besides, it is not often that two persons - native speakers or fluent speakers of a foreign language – will not differ in their intuitive perception of what is acceptable in the language. However, if annotators stick to the ...

Added: December 11, 2016

TaLC 12 - Teaching and Language Corpora Conference

[б.и.], 2016.

Various issues relating to the questions of learner corpus researches and their use in teaching are presented. These include the issue of a norm in corpora whether the norm should necessarily be native and what problems a native norm may present. Learners who behave differently from native speakers do not necessarily use language incorrectly as ...

Added: December 10, 2016

Omnia Russica: Even Larger Russian Corpus

Shavrina T., Benko V., , in: Труды международной конференции "Корпусная лингвистика - 2019". СПб.: Издательство Санкт-Петербургского университета, 2019. Ch. 13 P. 94–102.

This paper focuses on combining Russian open corpus resources into one single source. The article describes the motivation for gradual integration of existing text resources to create a more general project and analyzes in detail the main steps to merge the existing data to formats based on NoSketch Engine corpus standards and interface. ...

Added: September 9, 2019

Аннотирование учебного корпуса в аспекте его использования для исследовательских задач

Klimova M., Viklova A., Overnikova D., В кн.: Современная лингвистика: от теории к практике. III Казанский международный лингвистический саммит (Казань, 14–19 ноября 2022 г.): Труды и материалы, в трёх томах, том 1. Каз.: Издательство Казанского университета, 2022. С. 46–50.

В данной статье рассматривается классификация ошибок, используемая в учебном корпусе REALEC, в аспекте ее соответствия требованиям и приспособленности для исследовательских задач. ...

Added: January 17, 2023

Национальный корпус русского языка 2.0: новые возможности и перспективы развития

Савчук С. О., Архангельский Т. А., Bonch-Osmolovskaya A. A. et al., Вопросы языкознания 2024 № 2 С. 7–34

The paper provides an overview of the results of the fundamental reconstruction and modernization project of the National Corpus of the Russian Language platform, carried out from 2020 to 2023. The focus of the paper is on the new opportunities that are opening up for linguists and a wider audience. This includes improving the representativeness ...

Added: March 21, 2024

Корпусный анализ русского стиха

М.: Азбуковник, 2013.

В настоящий сборник вошли статьи, подготовленные с использованием материалов поэтического корпуса Национального корпуса русского языка. Авторы статей прослеживают на обширном материале историю отдельных слов в языке поэзии, анализируют разные аспекты поэтической грамматики и семантики, рассматривают некоторые формальные параметры русского стиха. Сборник предназначен для специалистов в области лингвистической поэтики, стиховедения, а также для тех, кто интересуется современными ...

Added: September 28, 2013

Двусложные сравнительные союзы в русской поэзии

Piperski A., В кн.: Труды Международной научной конференции "Корпусная лингвистика-2015". СПб.: Издательство СПбГУ, 2015. С. 374–381.

The paper deals with the use of disyllabic comparative conjunctions budto, slovno and točno ‘like’ in the texts of fifteen Russian poets. I study the frequency of their use in cases where these conjunctions are mutually interchangeable and show that their total frequency increases after the end of the Golden Age of Russian poetry (approx.. ...

Added: March 15, 2017

Еще раз об исследовательском потенциале поэтического корпуса: метр, лексика, формула

Orekhov B., Труды института русского языка им. В.В. Виноградова 2015 № 6 С. 449–463

The article continues the trend of other researchers’ publications that demonstrate the opportunities of the poetic subcorpus of the Russian National corpus. The question is, what issues related to the history of Russian poetry can be solved with the help of the corpus. In the first part of the article there is a pilot study ...

Added: March 16, 2016

Грамматические профили и формальная дифференциация русских двувидовых глаголов

Piperski A., В кн.: Двенадцатая Конференция по типологии и грамматике для молодых исследователей. Тезисы докладов (Санкт-Петербург, 19–21 ноября 2015 г.). СПб.: Издательство Нестор-История, 2015. С. 69–72.

Исследование особенностей русских двувидовых глаголов при помощи корпусных методов ...

Added: November 22, 2015

The Second Genitive in Russian

Daniel M., , in: Partitive cases and related categories. Berlin, NY: De Gruyter Mouton, 2014. Ch. 9 P. 347–377.

This paper is an overview of the so-called second genitive in Russian, a nominal form available for a minority of Russian nouns but widely used with these nouns in certain contexts. In many ways, the second genitive is a secondary case. Thus, it may always be substituted with a regular genitive form, while the opposite ...

Added: October 17, 2013

Referential Choice: Predictability and Its Limits

Kibrik A. A., Khudyakova M., Dobrov G. B. et al., Frontiers in Psychology 2016 Vol. 7 No. 1429 P. 1–21

We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, ...

Added: September 28, 2016

Корпус как инструмент и как идеология: о некоторых уроках современной корпусной лингвистики

Plungian V., Русский язык в научном освещении 2008 № 16 (2) С. 7–20

Added: November 12, 2023

Language Interference in Heritage Russian: Constructional Violations

Rakhilina E. V., Vyrenkova A. S., / NRU HSE. Series WP BRP "Linguistics". 2014. No. 11.

The problem of incomplete language acquisition and heritage languages is approached from several perspectives: who are heritage speakers, how are they different from native speakers and L2 learners, is heritage language a particular system? This paper aims at answering these and other questions focusing on constructional deviations in the output of heritage speakers and linguistic ...

Added: October 23, 2014

Maninka Reference Corpus: A Presentation

Vydrin V., Rovenchak A., Maslinsky K. A., , in: Actes de la conférence conjointe JEP-TALN-RECITAL 2016Vol. 11: Traitement automatique des langues africaines (TALAf) . P.: Association pour le Traitement Automatique des Langues, 2016. P. 87–94.

An annotated corpus of Guinean Maninka, Corpus Maninka de Référence (CMR), was published in April 2016. It includes two subcorpora: one contains texts originally written in Latin-based graphics (792,778 words), and the other one is composed of texts in N'ko alphabet (3,105,879 words). Both subcorpora are searchable in both Latin-based graphics and in N'ko. In ...

Added: March 10, 2017

Публика

Skorinkin D., В кн.: Два века в двадцати словах. М.: Издательский дом НИУ ВШЭ, 2016. С. 294–316.

Статья рассказывает о развитии и изменении значений слова "Публика" на протяжениии XIX-XX веков ...

Added: May 12, 2016

После, через, спустя во временны́х контекстах: из наблюдений над текстами казахско-русских билингвов

Rakhilina E. V., Казкенова А. К., Akhapkina Y., Вестник Томского государственного университета. Филология 2021 Т. 73 С. 93–113

Рассматриваются случаи нестандартного употребления казахско-русскими билингвами предлогов после, через и спустя во временны́х контекстах. Доказывается, что отклонения обусловлены грамматическими различиями между родным и русским языками. Анализ отклонений выявил специфические черты предлогов: способность указывать на завершение событий и отрезков времени, как единичных, так и повторяющихся, а также неоднозначность через в составе сочетаний с названиями разных временны́х интервалов. ...

Added: December 1, 2021

USE OF LEARNER CORPUS IN GENERAL ENGLISH AND ACADEMIC ENGLISH COURSES AT THE HIGHER SCHOOL OF ECONOMICS

Vinogradova O. I., , in: Conference Proceedings. The Future of Education International Conference The Future of Education, 6th edition. Padova: libreriauniversitaria, 2016. P. 310–314.

There have been many reports on advances in the development of learner corpora that have made it possible to effectively use these collections of texts for the benefit of the learning process. This paper lists all possible applications in English courses taught to Bachelor students of a middle-size learner corpus REALEC, which comprises student written ...

Added: March 1, 2017

Discovering dialectal differences based on oral corpora

Andriyanets V., Daniel M., Pakendorf B., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.)Вып. 17(24). М.: Издательский центр «Российский государственный гуманитарный университет», 2018. P. 28–38.

This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To ...

Added: June 19, 2018

Компьютерные методы анализа для определения гендерной принадлежности текста. Опыт практического исследования

Khomenko A., В кн.: Когнитивно-дискурсивная парадигма в лингвистике и смежных науках: современные проблемы и методология исследования: материалы Х Международного конгресса по когнитивной лингвистике. 17–20 сентября 2020 г.Т. 2(41). Уральский государственный педагогический университет, 2020. С. 893–897.

В настоящей статье речь пойдет о применении интегративного подхода к определению гендера в рамках решения задач судебной лингвистики. Автор интегрирует методы когнитивной науки, корпусной и, шире, компьютерной лингвистики, а также классический структурный анализ текста для идентификации характеристик мужской и женской речи. ...

Added: August 11, 2021

О способах и средствах выражения страха в русской языковой картине мира

Botchkarev A., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2016 Т. 14 № 3 С. 5–14

This article explores the ways of displaying fear in the Russian language image of the world. According to the National Corpus of the Russian language, in its most usual manifestation, fear covers and paralyzes; this distressing emotion is caused by somebody, apprehension to lose something or somebody as well as by exposure to an imminent ...

Added: November 28, 2016

Temperature terms in modern Eastern Armenian

Daniel M., Khurshudian V., , in: Linguistics of Temperature. Amsterdam: John Benjamins Publishing Company, 2015. P. 392–439.

This paper is an analysis of lexical categorisation of the temperature domain in modern Eastern Armenian. Compared to the vast research outline proposed in (Koptjevskaja-Tamm 2011), this paper has several important limitations. First, it is focused on non-derived, primary temperature terms (most of which happen to be adjectives or nouns, or both). Derived lexical items, ...

Added: October 17, 2013