Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 4 — 8 июня 2014 г.)
Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” (2014)
Analyzing several Russian nouns denoting everyday life objects, we explain why a word sense frequency dictionary is necessary. Techniques of calculating the approximate frequencies are proposed, based on the analysis of native speaker surveys and the annotation of the most frequent collocations in a large text corpus (we used the huge RuTenTen11 corpus integrated into the Sketch Engine system). A word sense dictionary could be used in a variety of NLP tasks, in particular for a probabilistic word sense disambiguation without available context, in creating second language learning resources, as well as in academic lexicography. Besides, studies of sense sets of polysemous words and their comparative frequencies are important for the linguistic theory, because they shed light on the evolution of the lexical system.
The paper discusses valency realizations of Russian predicate nouns in certain types of syntactic constructions (mainly, existential ones like Mne net neobxodimosti sdavat ekzamen ‘There is no need for me to take the exam’; lit. ‘to me there is no necessity...’) where these realizations are not directly linked with the nouns concerned. In these cases, subcategorization frames of nouns are insufficient to account for the correct semantic interpretation of the construction in text analysis, or the adequate choice of valency implementation in text generation. For every word, detailed information on how its valencies are implemented within particular constructions should be supplied in the dictionary.
In the article the most important and interesting linguistic projects led by Ilya Segalovich (1964–2013) — one of the founders of the Yandex search engine — are considered. He also took part in their development. The following projects are among them. Development of the morphological analysis and synthesis of Russian words with a possibility of processing «new» words not included in the dictionary; solving the problem of morphological ambiguity for the Russian language with the help of normalizing substitutions; practical transcription of foreign, individual and common words; automatic positioning of stresses and the analysis of poetic texts; creation of efficient methods of recognizing fuzzy duplicates for textual documents; development of the information and require system «The National Corpus of Russian », etc. Key ideas and approaches connected with the searching of solutions to complicated linguistic problems are described, and Ilya's role in the invention of these approaches and their further development is stated. Examples of non-trivial linguistic algorithms developed by Ilya in collaboration with his colleagues are given.
I consider constructions that involve the modal verb moch' or the modal adjective dolzhen and the subjunctive particle by. I argue that, with respect to the subjunctive, these modals behave differently from regular verbs. Their subjunctive is often functionally identical to the indicative; in contexts where other verbs obligatorily take the subjunctive form, these two predicates may use the indicative. The main factor that controls omissibility of the subjunctive particle is shown to be an epistemic interpretation. I consider some typical cases where the subjunctive and the indicative are synonymous for these predicates, and those where they are not. Thus, in the apodosis of conditional constructions the particle is often omitted, although, in general, Russian prefers a symmetrical use of the subjunctive in both protasis and apodosis. On the other hand, when in the protasis, the particle is not omitted. The subjunctive is often used with the modals for pragmatic purposes, such as politeness. The paper is based on the data from the Russian National Corpus.
The paper describes a corpus of dialectal Russian speech under development. The corpus relies on interviews conducted by a joint Swiss-Russian team in the summer of 2013 in a small cluster of North Russian villages with the goal of studying the local dialect from a sociolinguistic and dialectological perspective. The interviews are transcribed into standard Russian and thus do not involve a detailed phonetic representation. The text is then lemmatized and grammatically annotated with standard tools and fed into a corpus. The corpus can be queried via a web-based interface which provides the user with access to the original sound recordings on a per-utterance level. This design, the paper argues, allows for a rapid development of the corpus without a major loss in usability, since the audio data are readily available. Future plans include more field trips as well as a more convenient interface providing, among other features, for user correction of the transcription.
The paper reports on the recent forum RU-EVAL ‒ a new initiative for evaluation of Russian NLP resources, methods and toolkits. The first two events were devoted to morphological and syntactic parsing correspondingly. The third event is devoted to anaphora and coreference resolution. Seven participating IT companies and academic institutions submitted their results for anaphora resolution task and three of them presented the results of coreference resolution task as well. The event was organized in order to estimate the state of the art for this NLP task in Russian and to compare various methods and principles implemented for Russian. We discuss the evaluation procedure. The anaphora and coreference tasks are specified in the present work. The phenomena taken into consideration are described. We also give a brief outlook of the similar evaluation events whose experience we lay upon. In our work we formulate the training and Gold Standard corpora construction guidelines and present the measures used in evaluation.