Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 31 мая — 3 июня 2017 г.). Вып. 16 (23): В 2 т.
The 16th issue of the annual report “Computational Linguistics and Intellectual Technologies” contains the selected materials of the 23rd international conference “Dialogue”. The presented works reflect the areas of research in computational modelling and analysis of natural language that are traditionally represented at the conference.
Tolerance is a complex and partly contradictory concept that can be understood differently not only in different cultures, but also within the same culture. This paper presents a comparative study of the perception of tolerance by Russian and English speakers based on analysis of corpus data. At the initial stage of the study, the authors semi-automatically compiled a pilot web-based corpus of texts about tolerance. The corpus consists of a Russian-language subcorpus of 199,607 words and an English-language subcorpus of 210,898 words. After the mini-corpus was analyzed, the results were verified on the data from the general corpora ruTenTen11 and enTenTen13 using the Sketch Engine platform. The authors compared the word sketches for толерантность (tolerantnost’), tolerance, толерантный (tolerantnyi) and tolerant. In particular, this implied analyzing various lexical-semantic fields and thematic groups of collocates, as well as the following patterns: X толерантности (tolerantnosti) and X of tolerance, толерантность к (tolerantnost’ k) X and tolerance towards X, толерантность и/или (tolerantnost’ i/ili) X and tolerance and/or X. In addition, various derivatives of толерантность (tolerantnost’) / tolerance were discovered in the corpora and analyzed, including numerous nonce words. The corpus analysis enabled a deep insight into the way tolerance is perceived by Russian and English speakers.
In this paper we apply network analysis to the study of literature. At the first stage of our investigation we automatically extract networks (graphs) of characters for each part of Leo Tolstoy’s novel War and peace using two different techniques for network creation. Then we evaluate these two techniques against a set of manually created gold standard networks. Finally, we use the method that demonstrated better performance in our evaluation to test a literary hypothesis about Tolstoy’s novel. The hypotheses we intended to prove was that the parts of the novel describing war (i.e. those where the battlefield or military units are the primary settings), have statistically lower density of interaction between characters, resulting in lower network density, higher network diameters and lesser average node degrees. By showing this correlation we mean to demonstrate the applicability of network analysis to computational research of fictional narrative (e.g. detection of tension changes in the plot).
The paper reports some results of the research, aimed at finding out whether place coarticulation available in clusters of [labial or dental nasal + labiodental obstruent] within phonological word and in an external sandhi position in Modern Standard Russian and whether it may serve as a cue for detecting the presence of prosodic breaks and the order of phonological rules. The results obtained show that the F2 value of nasal before labiodental obstruent is significantly higher for bilabial one and significantly lower for coronal one as compared with their F2 values in the position before gomorganic stops. This type of place coarticulation is found only within phonological word and not available in an external sandhi position; thus the absence of this type of coarticulation may serve as a cue for detecting the presence of prosodic break. In the case of clusters with final palatalized labiodental obstruent the F2 value of bilabial nasal is found to be significantly higher that those of the coronal one since there is a palatalization coarticulation exists in Modern Standard Russian for bilabials but not coronals before labiodentals. Thus, we argue that the phonological rule of palatalization operates before the rule for place assimilation in Standard Russian.
The presented research was carried out on the material of the ORD speech corpus in the framework of the project, dedicated to study sociolinguistic variation of Russian speech and aimed at identifying diagnostic features characterizing everyday speech of major social groups (age-, gender-, status-, professional-related, etc.). The obtained results showed that practically on each linguistic level one may observe the features exhibiting a very high similarity between different sociolects. In particular, the coincidence is observed in the distribution of phonemes, distribution of parts of speech, and the frequency of some syntactic structures. The distribution of phonemes was determined on the subcorpus of 172,000 allophones. The following ten phonemes are the most frequent in speech of all social groups: /a/ (18,18%), /i/ (9,04%), /t/ (6,36%), /o/ (5,43%), /u/ (4,49%), /n/ (4,11%), /j/ (3,82%), /e/ (3,57%), /k/ (3,35%), /s/ (3.01%). The distribution of parts of speech in everyday speech was obtained on the linguistically annotated subcorpus of 125,437 tokens and has the following breakdown: V (17,43%), S (15,29%), S-PRO (14,13%), PART (13,35%), CONJ (9,47%), PR (7,09%), ADV-PRO (5,30%), ADV (4,51%), A-PRO (4,30%), A (3,73%), PRAEDIC (1,84%), INTJ (1,41%), NUM (1,29%), PARENTH (0,56%), ANUM (0,27%), PRAEDIC-PRO (0,01%). At the syntactic level, one-element structures are prevailing in everyday speech of all social groups, the most frequent among them being D (particle / discursive word) (3,73%), S (2,26%), and V (1,88%). Statistical analysis of the left-branching and right-branching verb groups has showed that the first ones significantly prevail in speech of all sociolects. The revealed features reflect some constant, universal properties of everyday spoken Russian and can be used for adjustment and improvement of speech synthesis and recognition systems.
This paper describes experiments on humorous response generation for short text conversations. Firstly, we compiled a collection of 63,000 jokes from online social networks (VK and Twitter). Secondly, we implemented several context-aware joke retrieval models: BM25 as a baseline, query term reweighting, word2vec-based model, and learning-to-rank approach with multiple features. Finally, we evaluated these models in two ways: on the community question answering platform Otvety@Mail.ru and in laboratory settings. Evaluation shows that an information retrieval approach to humorous response generation yields satisfactory performance.
Our experiment is aimed at evaluating the performance of distributional semantic features in metaphor identification in Russian raw text. We apply two types of distributional features representing similarity between the metaphoric/ literal verb and its syntactic or linear context. Our approach is evaluated on a dataset of nine Russian verb context, which is made available to the community. The results show that both sets of similarity features are useful for metaphor identification, and do not replicate each other, as their combination systematically improves the performance for individual verb sense classification, reaching state-of-the-art results for verbal metaphor identification. A combined verb classification demonstrates that the suggested features effectively generalize over metaphoric usage in different verbs, shows that linear coherence features perform as well as the combined feature approach. By analyzing the errors we conclude that syntactic parsing quality is still modest for raw-text metaphor identification in Russian, and discuss properties of semantic models required for high performance.