Государственные языки России в Википедии: к вопросу о сетевой активности минориторных языковых сообществ
About Wikipedia on Langs of Russia
Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.
The volume is the third issue of a corpora-based grammar of Russian. The volume deals with the issues of parts of speech and, more generally, with formal classes of lexicon, It comprises descriptive papers of separate POS and lesser world classes.
In response to the growing demand for highly proficient foreign language (L2) speakers in professional work settings, scholars and educators have increasingly turned their attention to methods for developing greater fluency in their learners who aspire to such jobs. Engaging in persuasive writing and argumentation has been shown to promote both written and oral proficiency among advanced L2 learners (Brown, 2009). This study focuses on the application of the American Council on the Teaching of Foreign Languages (ACTFL) proficiency guidelines and standards to the design of teletandem courses in English as a Foreign Language (EFL) and Russian as a Foreign Language developed to promote Advanced and Superior-level language gains. ACTFL Can-Do statements were used to evaluate learners’ self-reported language gains as a result of participating in the course. The results indicated that such an approach can indeed yield significant perceived gains, especially for spoken language, for all the participants regardless of their target language and home institution.
The book includes 64 papers submitted to the International conference in computer linguistics and intellectual technologies Dialogue 2019 and presents a broad spectrum of theoretical and applied research of natural language description, language simulation, and creation of applied computer technologies.
Paralinguistic phenomena are non-verbal elements in conversation. Paralinguistic studies are usually based on audio or video recordings of spoken communication. In this article, we will show what kind of audible paralinguistic information may be obtained from the ORD speech corpus of everyday Russian discourse containing long-term audio recordings of conversations made in natural circumstances. This linguistic resource provides rich authentic data for studying the diversity of audible paralinguistic phenomena. The frequency of paralinguistic phenomena in everyday conversations has been calculated on the base of the annotated subcorpus of 187,600 tokens. The most frequent paralinguistic phenomena turned out to be: laughter, inhalation noise, cough, e-like and m-like vocalizations, tongue clicking, and the variety of unclassified nonverbal sounds (calls, exclamations, imitations by voice, etc.). The paper reports on distribution of paralinguistic elements, non-verbal interjections and hesitations in speech of different gender and age groups.
The paper presents a corpus-driven study of the Russian PP-based degree modifier do uzhasa (lit. ‘to horror’), suggesting a two-stage grammaticalization path. The first stage (presumably, XVIII–XIX c.) involves subjectification, while during the second stage, subjective readings give rise to intensifier readings through conceptual metonymy. Both stages see a host class expansion. This process is motivated by a complex interplay of factors, with analogy playing a major role. Finally, the evolution of do uzhasa is contrasted to that of the English PP-based intensifier to death. While there are obvious similarities, a closer look identifies a number of important differences that are relevant for the development of construction-based typology of language change.
Recent demands in authorship attribution, specifically, cross-topic authorship attribution with small numbers of training samples and very short texts, impose new challenges on corpora design, feature and algorithm development. In the current work we address these challenges by performing authorship attribution on a specifically designed dataset in Russian. We present a dataset of short written texts in Russian, where both authorship and topic are controlled. We propose a pairwise classification design closely resembling a real-world forensic task. Semantic coherence features are introduced to supplement well-established n-gram features in challenging cross-topic settings. Distance-based measures are compared with machine learning algorithms. The experiment results support the intuition that for very small datasets, distance-based measures perform better than machine learning techniques. Moreover, pairwise classification results show that in difficult cross-topic cases, content-independent features, i.e., part-of-speech n-grams and semantic coherence, are promising. The results are supported by feature significance analysis for the proposed dataset.