Государственные языки России в Википедии: к вопросу о сетевой активности минориторных языковых сообществ
About Wikipedia on Langs of Russia
In this article we report some new experiments in the area of words clustering for the Russian language. We introduce a new clustering method that distributes words into classes according to their syntactic relations. We used a large untagged corpus (about 7,2 bln of words) to collect a set of such relations. The corpus was processed using a set of finite state automata that extracts syntactically dependent combinations having explicit structure. These automata were used to process only unambiguous text fragments because of combination of these techniques increases the quality of sampled input data. The modification of group average agglomerative clustering was used to separate words between clusters. The sampled set of clusters was tested using one of the semantic dictionaries of the Russian language. The NMI score calculated in this article is equal to 0.457 and F1-score is 0.607.
The aim of the present paper is to investigate Russian compounds formed with agentive suffixes from a cognitive perspective. These compounds, in which two lexical roots are followed by an agentive suffix (including the zero agentive suffix), have been analyzed based on the metonymical shifts underlying their formation, following the methodology proposed by Janda (2011) for suffixal word-formation. We compare the behavior of agentive suffixes in compounding and in suffixation (cf. Janda 2011), shedding some light on the similarities and differences between the two word-formation processes. We also employ the cognitive tools of metonymy and metaphor to identify other significant shifts occurring at the lexical level and concerning the source element of the compounds.
Changes in modern Russian due to the expansion of the new technologies; Russian of the Internet (Runet). Social and cultural consequences of the CMC-revolution.
The paper compares two rival word-formation constructions giving rise to compound agent nouns in Russian, i.e., (para)synthetic compounds formed with the agentive suffixes -ec and -tel’, such as basnopisec ‘fable writer’ and bytopisatel’ ‘everyday-life writer’. To understand what makes these constructions different from one another, compounds in -ec and -tel’ are analyzed based on a number of formal and semantic criteria, i.e., the part of speech and semantic role of the non-verbal element of the compound, the transitivity and formal aspect of the verbal base of the compound, the animacy of the compound’s referent, and the semantics of the compound. The study is supported by statistical analyses, i.e., conditional inference trees and random forests, which help discriminate the behavior of rival constructions and determine which parameters are more relevant for the comparison. To understand whether diachronic and/or stylistic factors also affect the survival of rival constructions, the data are checked in the Russian National Corpus, which allows retrieving information about the texts in which compounds occur, such as their creation date and textual genre. Finally, the productivity of rival word-formation constructions in modern Russian is discussed both in terms of diachronic changes and in terms of restrictions that the two constructions are subject to. The analyses carried out demonstrate that the two constructions show significant differences regarding their semantics, but also their diachronic and stylistic distribution, as well as their productivity, which prevents one construction from completely ousting the other in modern Russian.
The volume is the third issue of a corpora-based grammar of Russian. The volume deals with the issues of parts of speech and, more generally, with formal classes of lexicon, It comprises descriptive papers of separate POS and lesser world classes.
We ask whether the aspect of individual verbs can be predicted based on the statistical distribution of their inflectional forms and how this is influenced by genre. To address these questions, we present an analysis of the “grammatical profiles” (relative frequency distributions of inflectional forms) of three samples of verbs extracted from the Russian National Corpus, representing three genres: Journalistic prose, Fiction, and Scientific-Technical prose. We find that the aspect of a given verb can be correctly predicted from the distribution of its forms alone with an average accuracy of 92.7%. Remarkably, this accuracy is statistically indistinguishable from the accuracy of prediction of aspect based on morphological marking. We maintain that it would be possible for first language learners to use distributional tendencies, in addition to morphological and other cues (for example semantic and syntactic cues), in acquiring the verbal category of aspect in Russian.
In Russian negative sentences the verb’s direct object may appear either in Accusative case which is licensed by the verb (as is common cross-linguistically) or in Genitive case which is licensed by the negation (Russian-specific ‘Genitive-of-Negation’ phenomenon). Such sentences were used to investigate whether case marking is employed for anticipating syntactic structure, and whether lexical heads other than the verb can be predicted on the basis of a case-marked noun phrase. Experiment 1, a completion task, confirmed that Genitive-of-Negation is part of Russian speakers’ active grammatical repertoire. In Experiments 2&3, the Genitive/Accusative case manipulation on the preverbal object led to shorter reading times at the negation and verb in the Genitive vs. Accusative condition. Furthermore, Experiment 3 manipulated linear order of the direct object and the negated verb in order to distinguish whether the abovementioned facilitatory effect was predictive or integrative in nature, and concluded that the parser actively predicts a verb and (otherwise optional) negation on the basis of a preceding genitive-marked object. Similarly to a head-final language, case-marking information on preverbal NPs is used by the parser to enable incremental structure building in a free-word-order language such as Russian.
The present paper aims at investigating the productivity of the prefixoid samo- (‘self’) in Russian compounds from a diachronic perspective. In order to verify the hypothesis that the productivity of this prefixoid has grown over time, I consider the occurrences of samo-compounds in the Russian National Corpus, dividing the main corpus into four subcorpora, each one representing a particular time span: the 18th century, the 19th century, the 20th century and the period that lasts from the beginning of the 21st century to the present day. The approach chosen is quantitative in nature, and is based on the measure of “potential productivity” (Baayen & Lieber 1991; Baayen 1992, 1993), which is calculated by dividing the number of hapax legomena with a certain affix by the number of tokens with that affix. This measure, however, seems inadequate for the comparison of differently-sized corpora. To overcome this problem, I resort to parametric statistical models of frequency distribution known as LNRE (Large Number of Rare Events) models (Baayen 2001). These models, which allow extrapolating the expected values of types and hapax legomena with a given affix for arbitrary values of tokens, are implemented in the package zipfR (Baroni & Evert 2014), a tool for lexical statistics in R, which is used for this study.