Корпусные инструменты в грамматических исследованиях русского языка

О. Н. Ляшевская

?

Корпусные инструменты в грамматических исследованиях русского языка

M. : Языки славянской культуры, 2016.

Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents the first steps taken by Russian corpus linguistics toward the development of language corpora and corpus-based resources as well as their use in grammatical and lexical analysis.

The first part of the book focuses on the annotation of Russian texts at several levels: lemmas, part of speech and inflectional forms, word formation, lexical-semantic classes, syntactic dependencies, semantic roles, frames, and lexical constructions. We discuss various theoretical principles and practical considerations motivating the corpus markup design, provide details on the creation of lexical resources (electronic dictionaries and databases) and text processing software, and consider complicated cases that present challenges for the annotation of corpora both manually and automatically. In most cases we describe the annotation of the Russian National Corpus (RNC, ruscorpora.ru) and its affiliate project FrameBank (framebank.ru).

Frequency data depend not only on the representativeness and balance of texts in a corpus, but also on the rules and tools used for annotation. The book addresses the development of evaluation standards for Russian NLP resources, namely, morphological taggers and dependency parsers. In addition, the book presents several experiments on automatic annotation and disambiguation: lemmatization of word forms not in the dic- tionary; word sense disambiguation based on vectors formed by lexical, semantic and grammatical cues of context; and semantic role labeling.

The final chapters of the first part of the book outline two types of frequency dictionaries based on the RNC data: a general-purpose frequency dictionary and a lexico-grammatical one.

The second part of the book presents an analysis of corpus data and includes a number of case studies of Russian grammar and lexical-grammatical interaction using quantitative methods. The key concept underlying our analysis is the behavioral profile (Hanks 1996; Divjak, Gries 2006), which is the frequency distribution of variable elements in a linguistic unit as attested in a corpus. This covers grammatical profiles (the frequency distribution of inflected forms of a word), constructional profiles (the frequency distri- bution of argument or any other constructions attested for a key predicate), lexical and semantic profiles (the frequency distribution of words and lexical-semantic classes in construction slots or, more generally, in the context of a word), and radial category profiles (the frequency distribution of word senses and word uses across the radial category network of a polysemous unit). We use grammatical, constructional, semantic, and radial category profiling to study tense, aspect and mood specialization of Russian verb forms; to identify singular-oriented and plural-oriented nouns; to investigate factors for prefix choice and prefix variation in natural perfectives (chistovidovye perfectivy); to analyze constraints on the filling of slots in a construction and how this affects the meaning of the construction, taking as an example the Genitive construction of shape and the spatial construction with the preposition poverkh ‘up and over’.

The quantitative corpus-based techniques used for the analysis vary from simple descriptive statistics (e. g., absolute frequencies, percentages, measures of the central ten- dency and outliers) to exact Fisher test and logistic regression. We claim that the vector modeling approaches to quantitative grammatical studies in theoretical linguistics are no less effective than in computational linguistics, where they have become a standard tool.

Research target: Philology and Linguistics

Priority areas: humanitarian

Language: Russian

Sample Chapter

Full text

Keywords: русский язык Национальный корпус русского языка quantitative analysis корпусная лингвистика Russian language corpus linguistics разметка корпуса Russian National Corpus corpus annotation квантитативная лингвистика

Publication based on the results of:

Квантитативное корпусное исследование грамматической категории числа (2014)

Корпусные инструменты в грамматических исследованиях русского языка

A Data Analysis Tool for the Corpus of Russian Poetry

Lyashevskaya O., Vlasova E., Litvintseva K. et al., / НИУ ВШЭ. Series WP BRP "Linguistics". 2018. No. 77.

A data analysis tool of the Corpus of Russian Poetry (a part of the Russian National Corpus) is designed for quantitative research in various areas of versology and linguistics aspects of poetic texts. The core part, a statistic database of the corpus, includes annotation at the level of texts, verses, words as well as patterns ...

Added: December 13, 2018

Материалы к корпусной грамматике русского языка

СПб.: Издательство Нестор-История, 2018.

The volume is the third issue of a corpora-based grammar of Russian. The volume deals with the issues of parts of speech and, more generally, with formal classes of lexicon, It comprises descriptive papers of separate POS and lesser world classes. ...

Added: November 4, 2018

Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2019)

M.: Russian State University for the Humanitie, 2019.

The book includes 64 papers submitted to the International conference in computer linguistics and intellectual technologies Dialogue 2019 and presents a broad spectrum of theoretical and applied research of natural language description, language simulation, and creation of applied computer technologies. ...

Added: October 16, 2019

Квантитативные методы в диахронических корпусных исследованиях: конструкции с предикативами и дативным субъектом

Bonch-Osmolovskaya A. A., Компьютерная лингвистика и интеллектуальные технологии 2015 Т. 1 № 14(21) С. 80–95

The paper proposes new approaches to the problem of Russian dative subjects in predicative and adjective constructions. The core idea of the research is to study the distribution of dative subject constructions with predicative and adjective forms that potentially can be used in such constructions. The methodological novelty of the approach is manifested in the ...

Added: April 15, 2015

Использование Национального корпуса русского языка в преподавании иностранных языков

Prilepskaya M. V., Альманах современной науки и образования 2010 № 11(42) Ч.1 С. 111–113

В современной России изучение иностранных языков стало обязательным требованием к получению высшего образования не только в профильных языковых и гуманитарных, но и в рейтинговых неязыковых учреждениях высшего образования. Владение иностранным, преимущественно английским, языком является необходимой составляющей компетентности дипломированных специалистов в любой области их будущей профессиональной деятельности. Между тем знанию родного, русского, языка придается значение лишь ...

Added: November 7, 2012

Addressing people by name in Russian: A corpus study

Piperski A., Grabovskaya M., Gridneva E. et al., / НИУ ВШЭ. Series WP BRP "Linguistics". 2019. No. 92.

In Russian, there are many ways to address a person by name. For instance, a man called Aleksandr may be addressed as Aleksandr, Aleksandr Ivanovič, Saša, Sašen′ka, Saška, Sanja, etc. This study aims at analyzing the use of various strategies of naming the listener throughout the last two centuries. It uses the data from the ...

Added: December 15, 2019

Национальный корпус русского языка 2.0: новые возможности и перспективы развития

Савчук С. О., Архангельский Т. А., Bonch-Osmolovskaya A. A. et al., Вопросы языкознания 2024 № 2 С. 7–34

The paper provides an overview of the results of the fundamental reconstruction and modernization project of the National Corpus of the Russian Language platform, carried out from 2020 to 2023. The focus of the paper is on the new opportunities that are opening up for linguists and a wider audience. This includes improving the representativeness ...

Added: March 21, 2024

Cоциальные медиа в английском и русском языковом сознании. Статья 2. Корпусная лингвистика и опыт моделирования

Шляхова С. С., Klyuev N., Psiholingvistika 2020 Т. 28 № 2 С. 204–223

The article consists of two parts. The first part is devoted to the investigation of the structure and content of the concept “social media” in Russian and English linguistic consciousness according to the data of a serial psycholinguistic experiment. The second part provides an analysis of the concept in corpus linguistics. It also includes the ...

Added: May 25, 2021

Автоматическое определение частей речи для русского языка с помощью обучения трансформаций.

Kitov V. V., Научные труды Вольного экономического общества России 2014 Т. 186 С. 228–235

This paper describes the application of well-known «transformation-based learning» algorithm of automatic rule generation for the task of part-of-speech tagging. Algorithm is applied to corpora of annotated Russian texts and accuracy as well as most significant rules are shown. ...

Added: March 16, 2016

Лексический минимум по языку специальности: сколько слов достаточно? Разработка принципов минимизации

Olshevskaya M., Карпова Е. Л., Vlasova E., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2019 Т. 17 № 4 С. 63–77

Abstract This article analyses methodology of compiling Russian general wordlists and lexical minima for teaching Russian for specific purposes. The study systematizes three approaches: linguo-didactic, linguo-statistical, and corpus-based. The article also describes the process and results of applying all three methods to development of a lexical minimum based on political science corpus. The methodological analysis comprises ...

Added: October 2, 2019

Национальный корпус русского языка как основа новаторских электронных учебников

Sibirtseva V., Khomenko A., Baranova J., Образовательные технологии и общество 2013 Т. 16 № 3 С. 508–521

The article reports about the students and teachers research group of National Research University Higher School of Economics entitled "Corplingui (Nizhny Novgorod-Moscow)"development. This work is about the research in the field of computer and corpus linguistics. Development primarily focuses on the creation of interactive resources based on the materials of The Russian National Corpus. The ...

Added: October 4, 2013

An Exploratory Study on Sociolinguistic Variation of Russian Everyday Speech

Bogdanova-Beglarian N., Sherstinova T., Blinova O. et al., Lecture Notes in Computer Science 2016 Vol. 9811 P. 100–107

The research presented in this paper has been conducted in the framework of the large sociolinguistic project aimed at describing everyday spoken Russian and analyzing the special characteristics of its usage by different social groups of speakers. The research is based on the material of the ORD corpus containing long-term audio recordings of everyday communication. ...

Added: December 31, 2017

Коньячку бы, да до дому: хронология развития некоторых форм второго родительного падежа

Budennaya E., Труды института русского языка им. В.В. Виноградова 2024 № 2(40) С. 261–282

The article based on the material form Russian National Corpus discusses the diachronic development of structures with Russian second genitive case in three types of contexts: 1) with nominal quantifiers; 2) with the preposition bez ‘without’; 3) with the preposition do ‘towards’. The data obtained from Russian language are compared with the data from other languages (Finnic and several Turkic), in which there is a tendency to use the partitive ...

Added: October 4, 2024

Looking for contextual cues to differentiating modal meanings: A corpus-based study

Lyashevskaya O., Ovsjannikova M., Szymor N. et al., , in: Quantitative approaches to the Russian language. Abingdon: Routledge, 2018. P. 51–78.

The domain of modality is structurally diverse and may be described in multiple ways (for example, see Perkins, 1983; Wierzbicka, 1987; Hengeveld, 1988/2004; Sweetser, 1990; Bondarko, 1990; Bybee et al., 1994; van der Auwera and Plungian, 1998; Palmer, 2001; Hansen, 2004; Nuyts, 2006; Khrakovsky, 2007). The article reports on the Russian part of a larger survey ...

Added: October 24, 2017

Моделирование повседневного речевого поведения: Корпус устной речи молодежи, или ОРД V. 2.0.

Sherstinova T., Петрова И. А., Социо- и психолингвистические исследования 2023

To effectively model contemporary speech processes within daily communication, comprehensive linguistic resources, such as the ORD corpus, are indispensable. This paper introduces a novel resource which was being developed using a continuous audio recording methodology capturing informant's verbal behaviors – youth oral speech corpus named ESC (Everyday Student Conversations) The primary objective behind this corpus' ...

Added: December 10, 2023

Интенсификатор "до ужаса" в русском языке на пути грамматикализации

Герасимов Д. В., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2016 Т. XII № 1 С. 336–363

The paper presents a corpus-driven study of the Russian PP-based degree modifier do uzhasa (lit. ‘to horror’), suggesting a two-stage grammaticalization path. The first stage (presumably, XVIII–XIX c.) involves subjectification, while during the second stage, subjective readings give rise to intensifier readings through conceptual metonymy. Both stages see a host class expansion. This process is ...

Added: November 27, 2017

Russian challenges for quantitative research

Kopotev M., Lyashevskaya O., Mustajoki A., , in: Quantitative approaches to the Russian language. Abingdon: Routledge, 2018. P. 3–29.

The Russian language, despite being one of the most studied in the world, until recently has been little explored quantitatively. After a burst of research activity in the years 1960–1980, quantitative studies of Russian vanished. They are now reappearing in an entirely different context. Today, we have large and deeply annotated corpora available for extended ...

Added: October 24, 2017

Frequency dictionary of inflectional paradigms: core Russian vocabulary

Lyashevskaya O., / Basic Research Programme. Series HUM "Humanities". 2013.

A new kind of frequency dictionary is a valuable reference for researchers and learners of Russian. It shows the grammatical profiles of nouns, adjectives and verbs, namely, the distribution of grammatical forms in the inflectional paradigm. The dictionary is based on data from the Russian National Corpus (RNC) and covers a core vocabulary (5000 most ...

Added: May 13, 2013

Dialect loss in the Russian North: modeling change across variables

Daniel M., von Waldenfels R., Ter-Avanesova A. et al., Language Variation and Change 2019 Vol. 31 No. 3 P. 353–376

We analyze the dynamics of dialect loss in a cluster of villages in rural northern Russia based on a corpus of transcribed interviews, the Ustja River Basin Corpus. Eleven phonological and morphological variables are analyzed across 33 speakers born between 1922 and 1996 in a series of logistic regression models. We propose three characteristics for ...

Added: September 22, 2019

Concord in Russian close appositional constructions: a quantitative study

Logvinova N., Russian linguistics 2024 Vol. 48 No. 1 Article 4

The paper discusses case concord in Russian appositional constructions, which manifests itself in optional case concord of the proper name (v rek-eLOC Don-eLOC/ v rek-eLOC DonNOM ‘in the river Don’). The study provides an in-depth corpus analysis of more than 15,000 examples, using a logistic regression statistical model to predict the choice between presence and ...

Added: March 17, 2024

Прогностическая валидность глагольных форм длительного аспекта в корпусной лингвистике английского языка

Popkova E., Социосфера 2010 № 4 С. 74–81

The article discusses the most recent trends in the development of the English progressive. A corpus-based approach to linguistic research is seen as an effective means of determining reliability of the data retrieved and helps track the major diachronic dynamic in the increasing frequency of the progressive aspect that has taken place since the beginning ...

Added: November 6, 2012

Предикативное согласование со словами ряд, половина, часть, множество в современном русском языке

Kuvshinskaya Y. M., Сибирский филологический журнал 2019 № 2 С. 189–215

The work deals with the strategies for predicate agreement to quantified noun groups headed by nouns. In Russian, as in other Slavic languages, predicate agreement with quantified noun phrases allows singular or plural forms of the predicate. As for the sentences with quantifiers-nouns r’ad, polovina, chast’, mnozestvo, three agreement strategy are probable: predicate agrees with ...

Added: September 8, 2019

Adverbial phrases in Hasidic Yiddish

Arkhangelskiy T., Panova T., International Journal of the Sociology of Language 2014

The purpose of our study is to investigate the lexicalization of so-called adverbial phrases, such as fun a mol, in modern Hasidic Yiddish in comparison with written literary Yiddish of the 20th century. The phenomenon in question is a historical process in which several lexemes forming a frequent collocation (including nouns, adjectives, adverbs, prepositions and ...

Added: December 11, 2014

Using TXM Platform for Research on Language Changes over Time: The Dynamics of Vocabulary and Punctuation in Russian Literary Texts

Lavrentiev A. M., Sherstinova T., Chepovskiy A. et al., Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya 2021 Vol. 70 P. 69–89

The purpose of this paper is to test the methodological tools provided by TXM platform for research on dynamics of vocabulary and punctuation marks in diachronic corpora. TXM is a powerful text analysis software which provides both quantitative and qualitative features in a transparent open-source implementation. In this paper, we demonstrate how it can be ...

Added: June 24, 2021