Book
Quantitative approaches to the Russian language
This edited collection presents a range of methods that can be used to analyse linguistic data quantitatively. A series of case studies of Russian data spanning different aspects of modern linguistics serve as the basis for a discussion of methodological and theoretical issues in linguistic data analysis. The book presents current trends in quantitative linguistics, evaluates methods and presents the advantages and disadvantages of each. The chapters contain introductions to the methods and relevant references for further reading.
The Russian language, despite being one of the most studied in the world, until recently has been little explored quantitatively. After a burst of research activity in the years 1960-1980, quantitative studies of Russian vanished. They are now reappearing in an entirely different context. Today we have large and deeply annotated corpora available for extended quantitative research, such as the Russian National Corpus, ruWac, RuTenTen, to name just a few (websites for these and other resources will be found in a special section in the References). The present volume is intended to fill the lacuna between the available data and the methods that can be applied to studying them.
Our goal is to present current trends in researching Russian quantitative linguistics, to evaluate the research methods vis-à-vis Russian data, and to show both the advantages and the disadvantages of the methods. We especially encouraged our authors to focus on evaluating statistical methods and new models of analysis. New findings concern applicability, evaluation, and the challenges that arise from using quantitative approaches to Russian data.
The Russian language, despite being one of the most studied in the world, until recently has been little explored quantitatively. After a burst of research activity in the years 1960–1980, quantitative studies of Russian vanished. They are now reappearing in an entirely different context. Today, we have large and deeply annotated corpora available for extended quantitative research, such as the Rus- sian National Corpus, ruWac, ruTenTen, to name just a few (websites for these and other resources will be found in a special section in the References). The present volume is intended to fill the lacuna between the available data and the methods that can be applied to studying them.
Our goal is to present current trends in researching Russian quantitative linguis- tics, to evaluate the research methods vis-à-vis Russian data, and to show both the advantages and the disadvantages of the methods. We especially encouraged our authors to focus on evaluating statistical methods and new models of analysis. New findings concern applicability, evaluation, and the challenges that arise from using quantitative approaches to Russian data. The goal of this volume is therefore twofold: a) to address the topic of quantitative analysis of the Russian language, and b) to present an evaluation of methods applied to Russian data.
Russian has a relatively large group of biaspectual verbs, which can be used to convey both perfective and imperfective meaning. However, some of these verbs are used more often in perfective contexts and others in imperfective contexts, which is likely to influence the direction of the further development of overt aspectual oppositions in these verbs (such as whether a biaspectual verb will acquire a prefixed perfective aspectual partner or a suffixed imperfective partner). In this paper, I propose three methods for determining the status of a biaspectual verb, namely, by estimating the relative frequencies of its perfective and imperfective gerunds, by classifying its grammatical profile (i.e. frequencies of major Tense/Aspect/Mood categories) using the k Nearest Neighbors algorithm and by running an experiment on the perception of the inherent aspect of biaspectual verb forms. The study shows that Russian biaspectual verbs are gradually becoming more common in imperfective contexts. The classification based on the grammatical profiles of the verbs yields results that are quite close to the context-free perception of the aspect of biaspectual verb forms by Russian speakers. The data also show that Russian biaspectual verbs are quite dissimilar: some of them resemble imperfective verbs, while others behave more like perfective verbs, and between the two poles there is still a large group of truly biaspectual verbs.
The domain of modality is structurally diverse and may be described in multiple ways (for example, see Perkins, 1983; Wierzbicka, 1987; Hengeveld, 1988/2004; Sweetser, 1990; Bondarko, 1990; Bybee et al., 1994; van der Auwera and Plungian, 1998; Palmer, 2001; Hansen, 2004; Nuyts, 2006; Khrakovsky, 2007). The article reports on the Russian part of a larger survey of Slavic modal words and elucidates the role of formal and semantic context of modal words in a new way. The availability of large corpus data paves the way for study of the empirical reliability of existing classifications originally proposed by philosophers. An important property of the modal words is that they are largely ambiguous, developing new modal meanings both diachronically and from the synchronic point of view.
The chapter demonstrates how quantitative corpus methods used in linguistics research may help to rank different realizations of the same phenomena: the use of dative subjects in predicative and adjective constructions. The core idea of the research is to study the distribution of dative subject constructions with predicative and adjective forms that potentially can be used in such constructions, i.e., the tendency of the construction to be used in explication or omitting the dative subject. While usually the predicates are classified on the basis of whether they can potentially be used with a dative subject, the author studied the trends for explicit use of the dative (or prepositional beneficiary arguments) among the “dative subject predicates.” The chapter shows that the frequency rates of the real use of dative subjects can be very different with different predicates. Finally, data from the eighteenth and twenty-first centuries are compared and hierarchical clustering used to reveal diachronic trends.
According to G. K. Zipf’s observation, there is a strong correlation between word frequency and polysemy. Yet word sense frequency distribution is a neglected area in computational linguistics. Furthermore, the study of sense frequency has theoretical interest and practical applications for lexicography and word sense disambiguation. Although WordNet and SemCor contain some information about sense frequency in English, it is not enough for either practical or research purposes. This information is even lacking in Russian. To fill this lacuna, we developed and tested an automated system based on semantic vectors, which deals with the problem of sense frequency for Russian nouns. The model is first trained unsupervised on large corpora and then supplied with contexts and collocations from the Active Dictionary of Russian. The dictionary examples are used either for supervised post-training or for automatic labeling of clusters that are learned unsupervised. This allows us to reach a frequency estimation error of 11-15 percent on different corpora without additional labeled data. Word sense frequency distributions for 440 nouns are available online.
This paper focuses on empirical collocations, understood here as word co-occurrences that 1) are frequent enough to be extracted automatically and 2) may be semantically and/or syntactically bounded to various extents. Our main goal is to examine closely five window-based methods for empirical collocation extractions that are widely used in corpus-based studies, sometimes without proven efficiency. Our study evaluates the methods’ reliability for Russian data by testing two hypotheses: a) collocations listed in a professionally compiled dictionary (i.e., those considered fixed to some extent by experts in the field) should have higher rankings in automatically extracted lists of collocations, and b) collocations considered fixed expressions by native speakers should have higher rankings in automatically generated lists. Our research indicates that raw frequency, t-score, log-likelihood, and Dice give the best rankings, while MI and wFR demonstrate poorer results in both evaluations. In general, all of these evaluations, although each has its own limitations, lead to equatable results, which should be taken into account in future research.
Abstract: The Introductory chapter presents current trends in researching the Russian language quantitatively. It starts with a short description of main features of the Russian Grammar to help the reader follow this book without deep knowledge of the language. The main part overviews the quantitative studies in Russian conducted in 2000-2010s. We first address the concept of the linguistic profile, which has been explored largely using Russian data and which makes a significant contribution to modern linguistics. Second, we review some basic statistical tests before turning to more elaborate multivariate models. The chapter concludes with a comprehensive list of resources and tools available to researchers, and an extended list of references for further reading.

In this article we report some new experiments in the area of words clustering for the Russian language. We introduce a new clustering method that distributes words into classes according to their syntactic relations. We used a large untagged corpus (about 7,2 bln of words) to collect a set of such relations. The corpus was processed using a set of finite state automata that extracts syntactically dependent combinations having explicit structure. These automata were used to process only unambiguous text fragments because of combination of these techniques increases the quality of sampled input data. The modification of group average agglomerative clustering was used to separate words between clusters. The sampled set of clusters was tested using one of the semantic dictionaries of the Russian language. The NMI score calculated in this article is equal to 0.457 and F1-score is 0.607.
«Bankruptcy» Concept Within the Legal Linguistics Coordinates: Russian–English–French Approximations
The article addresses the notion of bankruptcy as perceived by speakers of current Russian, English and French languages both lawyers and participants in professional communication from other trades. Semantic structure of the term is identified based on its lexicographic and regulatory definitions.
These proceedings include papers on subjects from a wide number of areas including theoretical linguistics, translation, computational linguistics, natural language processing, and applied linguistics, focusing on a variety of languages, ranging from familiar Indo-European languages to Mandarin Chinese, Wolof, and Dene Sųɬiné. In order to make the papers available to the wider research community, these proceedings are being published electronically and distributed freely at http://www.meaningtext.net
Pleonastic Constructions In English Legal Texts
Quite a number of English legal texts, featuring largely contract law, provide linguistic evidence of both terminology, and/or commonly used vocabulary, with semantically identical or related meaning used at a time within the same text sequences. Such constructions appear challenging for taxonomic classification by linguists and lawyers alike. An analysis of examples allows for attributing such usage samples to pleonastic constructions typical for the legal language.
This paper deals with the Semantics/Pragmatics distinction in a contrastive ethnolinguistic aspect. I argue for the validity of this distinction based on cross-linguistic data. My claim is that the specificity of the so-called language key words [Wierzbicka 1990:15-17] - linguospecific items particularly representative of a given language speakersђ mentality - is due to pragmatic rather than semantic peculiarities. These pragmatic peculiarities distinguish the key words both from their synonyms within the same language and their counterparts in other languages. The languages under discussion are Russian and English, analyzed within a combined frame of Integral Language Description model [Apresjan 1995:8-238] and Wierzbickaђs ethnolinguistic approach.
This paper presents an analysis of forms of address used in reference to an unknown recipient in everyday communication. In describing the operation of the particular treatment as the author relies on the opinion of renowned experts in the field of speech etiquette and culture of Russian language and on their own linguistic observations and data from a survey conducted in the fall of 2010 the capital’s population aged 20-50 years.
We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.
We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.
We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.