Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora resolution as a way to exploit more co-occurrence data without directly increasing the size of the training corpus. We replace three different types of anaphoric pronouns with their antecedents in the training corpus and evaluate the extent to which this affects the performance of the resulting models in lexical similarity tasks. CBOW and SkipGram distributed models trained on Russian National Corpus are in the focus of our research, although the results are potentially applicable to other distributional semantic frameworks and languages as well. The trained models are evaluated against RUSSE '15 and SimLex-999 gold standard data sets. As a result, we find that models trained on corpora with pronominal anaphora resolved perform significantly better than their counterparts trained on baseline corpora.
Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its proto- type was formed based on other dictionaries and on the topic modeling per- formed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result—relevant words for the lexicon prototype and relevant texts for the training collection. Each word was as- sessed by at least three volunteers in the context of three di erent texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from −2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neu- tral sentiment scores. The quality of the lexicon was assessed with SentiSt- rength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classi ed correctly at the error level of ±1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text col- lection are publicly available at http://linis-crowd.org.
In this paper, we describe the rules and results of the FactRuEval informa- tion extraction competition held in 2016 as part of the Dialogue Evaluation initiative in the run-up to Dialogue 2016. The systems were to extract in- formation from Russian texts and competed in two named entity extraction tracks and one fact extraction track. The paper describes the tasks set be- fore the participants and presents the scores achieved by the contending systems. Additionally, we dwell upon the scoring methods employed for evaluating the results of all the three tracks and provide some preliminary analysis of the state of the art in Information Extraction for Russian texts. We also provide a detailed description of the composition and general orga- nization of the annotated corpus created for the competition by volunteers using the OpenCorpora.org platform. The corpus is publicly available and is expected to evolve in the future.
The work is devoted to an experimental study of the problem of familiar and unfamiliar speaker identification in vocal speech and in whisper.
The paper presents the initial/preparatory stage of the study of variation of hard/soft consonants before e in loanwords (ka[f]e). The main goal is to compile a database ofrelevant words for use in sociolinguistic research. The database is based on the list of word forms containing relevant contexts in users’ queries to Yandex. All entries in the database are annotated for parameters that may be important in a variational study of the phenomenon. The article describes how the list was compiled and the principles of its annotation. The latter includes the consonant, the position of the consonant re the stressed syllable, the type of syllable where it occurs (open/closed), the year of the first occurrence of the word in Russian National Corpus; the language from which it was borrowed; its frequency. The database may be used to select stimuli for experimental studies of variation in modern speech and of its social correlates (age, gender, education, etc).
The paper reports some results of the research, aimed at finding out whether regressive and / or progressive voice coarticulation available in clusters of homorganic labiodental consonants /v/ + /v/ in an external sandhi position in Modern Standard Russian may serve as a cue for detecting the location and depth of prosodic breaks. Combinations of labiodental fricatives /v/ + /v/ at the word junctures result in [ff], [vv] or [fv] pronunciation (with the decreasing abundance) in Modern Standard Russian. The percentage ratio of the above mentioned pronunciation types depends on the strength of the prosodic break between two words: • in the position within an intonation group (no prosodic break) [ff] pronunciation appears fairly stable and makes about 70% of the total case number, while the percentage of [fv] pronunciation (corresponding to the absence of the coarticulation) varies in the range of 1% – 11%. • in the position around prosodic break between two words group [fv] pronunciation detected in more then 80% out of the total case number studied.
The paper presents the most frequent words of everyday spoken Russian, that form the upper zones of several word frequency lists compiled on the material of Russian speech corpus “One Speaker’s Day” (the ORD corpus), containing real-life recordings of everyday communication. All speech data in the corpus is annotated in terms of communication settings, including 1) type of communication (language spoken style), 2) social role of speaker, 3) locus, etc. Such information allows speech to be filtered upon user request and therefore makes it possible to study speech variation depending on particular communication settings. The given study was made on the transcripts of 152 real-life macroepisodes and contains 232,370 words. The sample presents speech of 209 persons (95 men, 94 women, 20 children). The following word frequency lists have been compiled: a) general frequency list, b) male frequency list, c) female frequency list, and d) four frequency lists for different styles of spoken speech: informal conversations, professional/business conversations, educational communication, and “customer-service” communication. Men’s and women’s frequency lists have been compiled on the subsamples of 83,371 and 115,110 words correspondently. The analysis of word lists has shown that Russian women pay more attention to maintaining the conversation, use fewer hesitations, and are more inclined to use in their speech intensifying words, emotional words, hedges and interjections. Men generally use fewer personal pronouns, while numbers and the expletives are among the most frequent words used by men in everyday conversations. In general, these observations are similar to those described earlier for gender variation by other linguists.
InIn this paper we show that using deep textual parsing, which is finding complex features such as syntactic and discourse structures of the text, helps to improve the quality of style and genre classification. These results confirm achievements of many researches that have many times stated that using syntactic or morphological pattern for style and genre classification results in poor precision and recall. The best practice so far is to use n-gram patterns for this type of text classification problem. Syntactic and discourse structures allow however to capture some style of genre specific pattern of texts and to reach average precision higher than 95% on binary multi-genre classification.
The paper presents the rationale for the decisions that were taken in the set-up and further development of a learner corpus of student texts written in English by Russian learners of English, the only Russian learner corpus in the open access. The tool of manual expert annotation is in the focus of the present observations, and after introducing categorization of errors applied in annotation, the complicated cases that arose in annotation practices have been looked into followed by comparison of the annotation statistics over the three stages in the corpus development. For that purpose, texts annotated by different groups of participants in the process of two experiments were used to spot the problematic areas in annotation. The main pedagogical applications of the learner corpus in teaching EFL – the opportunities to create automated training exercises and placement and progress tests custom-made for specific groups of students - are outlined in the concluding part of the paper.
This paper describes the extraction of multiword expressions (MWEs) from corpora for inclusion in a large online lexical resource for Russian. The novelty of the proposed approach is twofold: 1) we use two corpora-the Russian National Corpus and Russian Wikipedia-in parallel and 2) employ an extended set of features based on both data sources. To combine syntactic and statistical features derived from two corpora, we experiment with several learning-to-rank (LETOR) methods that have been proven to be highly effective in information retrieval (IR) scenarios. We make use of bigrams from existing dictionaries for learning, which leads to very sparing manual annotation efforts. Evaluation shows that machine-learned rankings with rich features significantly outperform traditional corpus-based association measures and their combinations. Analysis of resulting lists supports the claim that multiple features and diverse data sources improve the quality of extracted MWEs. The proposed method is language-independent.