Русский язык на грани нервного срыва. 3D
We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in news texts from Russian mass media. For that, we employ diachronic word embedding models trained on large Russian news corpora from 2010 up to 2019. The users can explore the usage history of any given query word, or browse the lists of words ranked by the degree of their semantic drift in any couple of years. Visualizations of the words’ tra-jectories through time are provided. Importantly, users can obtain corpus examples with the query word before and after the semantic shift (if any). The aim of ShiftRy is to ease the task of studying word history on short-term time spans, and the influence of social and political events on word usage. The service will be updated with new data yearly.
We analyze the dynamics of dialect loss in a cluster of villages in rural northern Russia based on a corpus of transcribed interviews, the Ustja River Basin Corpus. Eleven phonological and morphological variables are analyzed across 33 speakers born between 1922 and 1996 in a series of logistic regression models. We propose three characteristics for a comparison of the rate of loss of different variables: initial level, steepness, and turning point. We show that the dynamics of loss differs significantly across variables and discuss possible reasons for such differences, including perceptual salience, initial variation in the dialect, and convergence with regionally or socially defined varieties of Russian. In conclusion, we discuss the pros and cons of logistic regression as an approach to quantitative modelling of dialect loss. Our paper contributes to the study and documentation of Russian dialects, most of which are on the verge of extinction.
The paper presents the results of GramEval 2020, a shared task on Russian morphological and syntactic processing. The objective is to process Russian texts starting from provided tokens to parts of speech (pos), grammatical features, lemmas, and labeled dependency trees. To encourage the multi-domain processing, five genres of Modern Russian are selected as test data: news, social media and electronic communication, wiki-texts, fiction, poetry; Middle Russian texts are used as the sixth test set. The data annotation follows the Universal Dependencies scheme. Unlike in many similar tasks, the collection of existing resources, the annotation of which is not perfectly harmonized, is provided for training, so the variability in annotations is a further source of difficulties. The main metric is the average accuracy of pos, features, and lemma tagging, and LAS. In this report, the organizers of GramEval 2020 overview the task, training and test data, evaluation methodology, submission routine, and participating systems. The approaches proposed by the participating systems and their results are reported and analyzed.
The paper presents firsts results of the pilot fieldwork of the Russian language of one group of East Siberian old-settlers in the context of their ethnic and cultural history and their role in Russian expansion eastward, including to Alaska in 18th -19th centuries. From one perspective, regional features of the old-settlers’ Russian testify to the cultural and historical processes that had involved various groups of Russian-speaking population of the East Siberia. From another perspective, these linguistic materials are compared to the data on Russian language in Alaska, which, supposedly, will help to clarify the processes that shaped Russian linguistic and cultural heritage of the only overseas Russian region.
The book includes 64 papers submitted to the International conference in computer linguistics and intellectual technologies Dialogue 2019 and presents a broad spectrum of theoretical and applied research of natural language description, language simulation, and creation of applied computer technologies.