Акцентологический корпус как инструмент для исследования русского ударения
The goal of the study is to show links between lexical and social diachronic change. The study is conducted in the culturomics framework (Michel et al 2011). In contrast to the Big data approach the study promotes the idea of medium data, i.e. amount of data which allows both to make quantitative and qualitative analysis.The research is based on the data from Russian National Corpus (ruscorpora.ru). The study pursues changes of context frequencies for the lexeme road in the period from 1800 till 2000, and correlates the observations with social and economic progress as well as change in conceptual language space
The collection includes scientific, literary, journalistic materials and interviews of culturologists and literary critics from different countries (Germany, Georgia, Poland, Russia, USA, etc.), as well as observers of post-Soviet transformations of Russian-Georgian relations. The main task of the research is to highlight the current topics and perspectives of modeling the new reality of this literary and cultural field within the framework of interdisciplinary and international dialogue. It was literature that provided useful material that allowed us to look behind the scenes of geopolitical narratives, since politically significant thought categories are inextricably linked with both literary images of Russia and Georgia, and with the Russian-Georgian myth. If in the Soviet era, this myth contributed to the almost ritual study of the history of relations between the two peoples, since the second half of the 1980s it has become a kind of Foundation for political, military, as well as, as it seemed, and cultural division of the once "fraternal republics".
Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents the first steps taken by Russian corpus linguistics toward the development of language corpora and corpus-based resources as well as their use in grammatical and lexical analysis.
The first part of the book focuses on the annotation of Russian texts at several levels: lemmas, part of speech and inflectional forms, word formation, lexical-semantic classes, syntactic dependencies, semantic roles, frames, and lexical constructions. We discuss various theoretical principles and practical considerations motivating the corpus markup design, provide details on the creation of lexical resources (electronic dictionaries and databases) and text processing software, and consider complicated cases that present challenges for the annotation of corpora both manually and automatically. In most cases we describe the annotation of the Russian National Corpus (RNC, ruscorpora.ru) and its affiliate project FrameBank (framebank.ru).
Frequency data depend not only on the representativeness and balance of texts in a corpus, but also on the rules and tools used for annotation. The book addresses the development of evaluation standards for Russian NLP resources, namely, morphological taggers and dependency parsers. In addition, the book presents several experiments on automatic annotation and disambiguation: lemmatization of word forms not in the dic- tionary; word sense disambiguation based on vectors formed by lexical, semantic and grammatical cues of context; and semantic role labeling.
The final chapters of the first part of the book outline two types of frequency dictionaries based on the RNC data: a general-purpose frequency dictionary and a lexico-grammatical one.
The second part of the book presents an analysis of corpus data and includes a number of case studies of Russian grammar and lexical-grammatical interaction using quantitative methods. The key concept underlying our analysis is the behavioral profile (Hanks 1996; Divjak, Gries 2006), which is the frequency distribution of variable elements in a linguistic unit as attested in a corpus. This covers grammatical profiles (the frequency distribution of inflected forms of a word), constructional profiles (the frequency distri- bution of argument or any other constructions attested for a key predicate), lexical and semantic profiles (the frequency distribution of words and lexical-semantic classes in construction slots or, more generally, in the context of a word), and radial category profiles (the frequency distribution of word senses and word uses across the radial category network of a polysemous unit). We use grammatical, constructional, semantic, and radial category profiling to study tense, aspect and mood specialization of Russian verb forms; to identify singular-oriented and plural-oriented nouns; to investigate factors for prefix choice and prefix variation in natural perfectives (chistovidovye perfectivy); to analyze constraints on the filling of slots in a construction and how this affects the meaning of the construction, taking as an example the Genitive construction of shape and the spatial construction with the preposition poverkh ‘up and over’.
The quantitative corpus-based techniques used for the analysis vary from simple descriptive statistics (e. g., absolute frequencies, percentages, measures of the central ten- dency and outliers) to exact Fisher test and logistic regression. We claim that the vector modeling approaches to quantitative grammatical studies in theoretical linguistics are no less effective than in computational linguistics, where they have become a standard tool.
Philological research, especially in the field of literature, is usually considered a "thing-in-itself"; the intrinsic value of this phenomenon involves extremely intuitive, creative, "human-readable" analysis. Meanwhile, modern variety of computer programs (semantic text referentors, tag clouds, concordansers, etc.), created also for the humanities, such as sociology, psychology, management, cannot but draw a philologist’s attention. The steps, how to work with a parallel subcorpus in Russian National Korpus, described in detail. Reviewed freeware LR aligner (for non-commercial use), compares translations in Russian the novel "All Red" by J.Chmielewska. As examples of lexical items selected the modal word "avos’", the word "nakonets" as an introductory and the circumstances of the word "ves’" and "tsely". The Program LF aligner treated three translations of the novel, the authors are M.Krongauz, V.Selivanova, O. Kuznetsova. Consistent description of the existing programs, testing them on art material and comparison of the received data with the existing traditional research, especially in the field of philology and foreign language teaching, is a new step of a text analysis.
The paper is intended to describe the experience of using the authentic linguistic corpus materials within the project "Creating an electronic textbook of Russian as a foreign language". Special attention is paid to the fundamental principles of the new project – automatic adaptation of RNC’s linguistic material. Worked out by means of information technologies, the product is supposed to adapt the complexity of authentic texts in terms of their syntactic and morphologic structures and vocabulary. The stages indispensable to attain the objective are also explained in the article. The paper describes not only the algorithm for solving the tasks and the final result of the research, but also the difficulties, which the developers face, and their solutions.
A new electronic frequency dictionary shows the distribution of grammatical forms in the inflectional paradigm of Russian nouns, adjectives and verbs, i.e. the grammatical profile of individual lexemes and lexical groups. While the frequency hierarchy of grammatical categories (e.g. the frequency of part of speech classes or the average ratio of Nominative to Instrumental case forms) has long been the standard topic of research, the present project shifts the focus to the distribution of grammatical forms in particular lexical units. Of particular concern are words with certain biases in grammatical profile, e.g. verbs used mostly in Imperative, in past neutral or nouns used often in plural. The dictionary will be a source for many of the future research in the area of Russian grammar, paradigm structure, grammatical semantics, as well as variation of grammatical forms.
The resource is based on the data of the Russian National Corpus. The article addresses some general issues such as corpora use in compiling frequency resources and technology of corpus data processing. We suggest certain solutions related to the selection of data and the level of granularity of grammatical profile. Text creation time and language registers are discussed as parameters which may shape the grammatical profile fluctuations.
Our research aims at automatic identification of constructions associated with particular lexical items and its subsequent use in building the catalogue of Russian lexical constructions. The study is based on the data extracted from the Russian National Corpus (RNC, http://ruscorpora.ru). The main accent is made on extensive use of morphological and lexico-semantic data drawn from the multi-level corpus annotation. Lexical constructions are regarded as the most frequent combinations of a target word and corpus tags which regularly occur within a certain left and/or right context and mark a given meaning of a target word. We focus on nominal constructions with target lexemes that refer to speech acts, emotions, and instruments. The toolkit that processes corpus samples and learns up the constructions is described. We provide analysis for the structure and content of extracted constructions (e.g. r:ord der:num t:ord r:qual|pervyj ‘first’ + LJUBOV’ ‘love’; LJUBOV’ ‘love’ + PR|s ‘from’ + ANUM m sg gen|pervyj ‘first’ + S f inan sg gen|vzgljad ‘sight’ = love at first sight). As regards their structure, constructions may be considered as n-grams (n is 2 to 5). The representation of constructions is bipartite as they may combine either morphological and lemma tags or lexical-semantic and lemma tags. We discuss the use of visualization module PATTERN.GRAPH that represents the inner structure of extracted constructions.
In this paper we compare the Russian National Corpus to a larger Russian web corpus composed in 2014; the assumption behind our work is that the National corpus, being limited by the texts it contains and their proportions, presents lexical contexts (and thus meanings) which are different from those found ‘in the wild’ or in a language in use.
To do such a comparison, we used both corpora as training sets to learn vector word representations and found the nearest neighbors or associates for all top-frequency nominal lexical units. Then the difference between these two neighbor sets for each word was calculated using the Jaccard similarity coefficient. The resulting value is the measure of how much the meaning of a given word is different in the language of web pages from the Russian language in the National corpus. About 15% of words were found to acquire completely new neighbors in the web corpus.
In this paper, the methodology of research is described and implications for Russian National Corpus are proposed. All experimental data are available online.
This volume contains contributions related to the accentology of the Baltic and Slavic languages. Some of these deal with the accentual properties of Baltic and Slavic languages or dialects, others discuss the historical development of these accentual systems. The volume also contains papers on similar accentual systems and developments in other languages, such as Abkhaz and the Mordvinian languages. The majority of the contributions were presented at the Third International Workshop on Balto-Slavic Accentology (IWoBA), which was held at Leiden University from 27 till 29 July 2007.
The paper is focused on the study of reaction of italian literature critics on the publication of the Boris Pasternak's novel "Doctor Jivago". The analysys of the book ""Doctor Jivago", Pasternak, 1958, Italy" (published in Russian language in "Reka vremen", 2012, in Moscow) is given. The papers of italian writers, critics and historians of literature, who reacted immediately upon the publication of the novel (A. Moravia, I. Calvino, F.Fortini, C. Cassola, C. Salinari ecc.) are studied and analised.
In the article the patterns of the realization of emotional utterances in dialogic and monologic speech are described. The author pays special attention to the characteristic features of the speech of a speaker feeling psychic tension and to the compositional-pragmatic peculiarities of dialogic and monologic text.