Quantitative approaches to the Russian language
This paper focuses on empirical collocations, understood here as word co-occurrences that 1) are frequent enough to be extracted automatically and 2) may be semantically and/or syntactically bounded to various extents. Our main goal is to examine closely five window-based methods for empirical collocation extractions that are widely used in corpus-based studies, sometimes without proven efficiency. Our study evaluates the methods’ reliability for Russian data by testing two hypotheses: a) collocations listed in a professionally compiled dictionary (i.e., those considered fixed to some extent by experts in the field) should have higher rankings in automatically extracted lists of collocations, and b) collocations considered fixed expressions by native speakers should have higher rankings in automatically generated lists. Our research indicates that raw frequency, t-score, log-likelihood, and Dice give the best rankings, while MI and wFR demonstrate poorer results in both evaluations. In general, all of these evaluations, although each has its own limitations, lead to equatable results, which should be taken into account in future research.
Abstract: The Introductory chapter presents current trends in researching the Russian language quantitatively. It starts with a short description of main features of the Russian Grammar to help the reader follow this book without deep knowledge of the language. The main part overviews the quantitative studies in Russian conducted in 2000-2010s. We first address the concept of the linguistic profile, which has been explored largely using Russian data and which makes a significant contribution to modern linguistics. Second, we review some basic statistical tests before turning to more elaborate multivariate models. The chapter concludes with a comprehensive list of resources and tools available to researchers, and an extended list of references for further reading.