The paper describes the noun phase and anaphora annotation in OpenCorpora and compares it to that in other corpora. We discuss the choice of representative texts for anaphoric annotation and the basic principles of syntactic annotation. In case of noun phrase annotation we followed the scheme introduced earlier for morphological annotation: it was carried out in two stages: firstly, all noun phrases and some other syntactic units were annotated by a heterogenous group of people, then a linguist compared all markup results and found the best one, or corrected mistakes. We present some annotation results and cases of annotator's disagreement and proceed to introduce our data-driven anaphora resolution system based on decision trees. We then list the features used to fit the classificator and discuss their relevance and some changes which improved the classificator performance. We also present out rule-based approach to automated noun phrase extraction using Tomita parser. A baseline for anaphora resolution is introduced and we compare it with our results.
The objective of this paper is to determine what semantic components in the meaning of a word facilitate its lexicalization as prosodically marked and aid its focalization in an utterance. The paper demonstrates that prosodic and communicative properties of a word correlate with its semantic properties. In particular, a case study of different senses of the words tol’ko ‘only’, pravda ‘true’, eshche ‘still, more’, voobshche ‘in principle, generally’, po krajnej mere ‘at least’ and some others reveals that focalization and prosodic marking in a word are triggered by the semantics of contrast, high degree, and addition. On the other hand, semantics of concession in the meaning of a word limits its ability for accentual marking and focalization. The observed correlations between semantics and prosody are confirmed by the multimedia corpus data.
Although there exist comprehensive morphologically annotated corpora for many morphologically rich languages, there have been no such corpora for any polysynthetic language so far. Developing a corpus of a polysynthetic language poses a range of theoretical and practical challenges for corpus linguistics. Some of these challenges have been partly addressed when developing corpora for languages with extensive morphological inventories and numerous productive derivation models such as Turkic or Uralic, while others are unique for this kind of languages. As we are currently working on a corpus of the polysynthetic West Circassian language, we had to identify these challenges and propose theoretical and practical solutions. These include the tokenization problem, which involves delimiting morphology from syntax, the problem with lemmatization and part-of-speech tagging, and a number of glossing and search issues. The solutions proposed in the paper are partly implemented and will be available for public testing when the preliminary version of the corpus is released.
The paper discusses evaluation techniques for semantic role labeling in Russian. It has been shown that the quality of FrameNet-style semantic role labeling largely depends on the quantity of roles and may decrease if the inventory of roles in the training set differs from that in the output resource. Our study is the first step towards the ‘smart’ evaluation tool which would introduce linguistically relevant criteria to evaluation; be able to put the mistakes on a scale from minor to critical ones; make evaluation easier in case the grid of roles varies.
We run an experiment based on the data from the Russian FrameBank, a FrameNet-oriented open access database which includes a dictionary of Russian lexical constructions and a corpus of tagged examples. The semantic role is one of the parameters that define the predicate-argument patterns in FrameBank. The inventory of roles is modeled hierarchically and
forms a graph. We explore the cases when the role induced by the system and the answer of the gold standard do not match. We analyze the statistical criteria of distribution of roles in the patterns and the distance between the source and the target in the graph of roles as a mean to assess the goodness of fit.
The paper presents clustering experiments on Russian verbs based on the statistical data drawn from the Russian FrameBank (framebank.ru). While lexicology has essentially abandoned the idea of syntactic transformations as the primary basis for grouping verbs into semantic classes (Apresjan 1967, Levin 1993), the hypothesis of the same lexical and syntactic distributional profiles underlying lexical clusters is still attractive. In computational linguistics, some attempts have been made to obtain verb classes for English, German and other languages using observable morpho-syntactic and lexical properties of context (Dorr and Jones 1996; Lapata 1999; Schulte im Walde 2006; Lenci 2014, among others). Our experiments on semantic classification of Russian verbs are based on two types of tags embedded in the annotation of argument constructions: a) semantic roles and b) morpho-syntactic patterns. The domain of speech verbs is classified automatically on vectors, and the resulting clusters are contrasted against Babenko (2007)’s semantic classes and three other manual classifications. The classes within the domain of possessive verbs are constructed using rule-based solutions and evaluated against Berkeley FrameNet verb clusters. We conclude that clustering on morpho-syntactic (pure formal) patterns loses the race to more intelligent approaches which take into account semantic roles.
This paper presents a rule-based approach to Information Extraction (IE) task within FactRuEval-2016 competition. Our system is based on ABBYY Compreno Technology. The technology uses the results of deep syntactic-semantic analysis, which leads to significant reduction of the number of necessary rules and makes them laconic. The evaluation was conducted on FactRuEval dataset. FactRuEval is an open evaluation of IE systems. The participants could take part in three tracks. The first track required to detect the boundaries and type of named entities in a text. The second track required to extract normalized attributes and perform local identification of named entities. The third track required to extract facts of certain types from a text. We took part in all three of the tracks with the nickname violet. Our method proved to be successful: we have achieved high F-measures in Named Entity Recognition tracks and the highest F-measure in Fact Extraction track.
Russian lexical stress exhibits both inter-speaker variation, defined by the speaker’s regional affiliation, social status, age, etc., as well as intra-speaker variation. The latter is difficult to capture due to the need for large corpora of spoken text produced by one speaker. These are lacking, but can be replaced with poetic corpora. We use automatic analysis of poetic texts by 10 poets, drawn from the Russian National Corpus, in order to find word forms that can have stress variation. The number of such forms for an individual speaker ranges between 30 and 200 words, distributed among different parts of speech. We propose a quantitative measure of overall stress variability independent of the corpus size and show that there is a tendency for this variability to diminish over time, at least in poetic texts.
The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts. We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list. For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed. Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small. The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.
The presented research was supported by the Russian Science Foundation, project #19-18-00525 “Understanding official Russian: the legal and linguistic issues”.
The paper describes the results of the first shared task on word sense induction (WSI) for the Russian language. While similar shared tasks were conducted in the past for some Romance and Germanic languages, we explore the performance of sense induction and disambiguation methods for a Slavic language that shares many features with other Slavic languages, such as rich morphology and virtually free word order. The participants were asked to group contexts of a given word in accordance with its senses that were not provided beforehand. For instance, given a word “bank” and a set of contexts for this word, e.g. “bank is a financial institution that accepts deposits” and “river bank is a slope beside a body of water”, a participant was asked to cluster such contexts in the unknown in advance number of clusters corresponding to, in this case, the “company” and the “area” senses of the word “bank”. For the purpose of this evaluation campaign, we developed three new evaluation datasets based on sense inventories that have different sense granularity. The contexts in these datasets were sampled from texts of Wikipedia, the academic corpus of Russian, and an explanatory dictionary of Russian. Overall, 18 teams participated in the competition submitting 383 models. Multiple teams managed to substantially outperform competitive state-of-the-art baselines from the previous years based on sense embeddings.
The paper presents a methodology and preliminary results for evaluating plagiarism detection algorithms for the Russian language. We describe the goals and tasks of the PlagEvalRus workshop, dataset creation, evaluation setup, metrics, and results.
In the article, I focus on tense marking in Russian constructions with predicatives, such as xolodno ‘(it is) cold’ and ploxo ‘(it is) bad’. Statistical data from the Russian National Corpus show that the frequency of past tense forms (e.g., combinations with the form bylo) is much greater for some predicatives than for others. This difference results both from semantic and formal factors. On the one hand, some predicatives denote evaluation (e.g. ploxo ‘bad’). Evaluation can be applied to events that have finished or have never been realized. What is relevant is that the evaluation is made at the moment of speech, and this is why the present tense (= the zero copula verb) is used. On the other hand, it is important that the present tense is unmarked with predicatives, while with verbs, it is marked with special verbal affixes. The unmarked present tense form of a predicative can get the temporal meaning from the embedded verb. Interestingly, this phenomenon is in a sense opposite to the well-known phenomenon of relative tense marking. While the latter presupposes that the tense assignment in the embedded event is anchored to the tense meaning of the main event, the tense value of the construction with evaluation predicatives is assigned by ‘agreement’ with the embedded verb.
The paper adresses parallels between tense, aspect and modality marking in Russian embedded clauses. It is widely known that tense forms of embedded verbs can be interpreted relatively or absolutely, and in some cases, the relative and absolute use seem to be in free variation. It turns out that the interpretation of modality and aspect can be described along the same lines and classified into the relative and absolute uses. For instance, subjunctive mood—one of the main instruments of irreality marking—can be interpreted as less real than the main event (relative interpretation) or less real than the moment of speech (and to the same degree as the main event; absolute interpretation). Similarly, aspect forms, depending on their interpretation, can describe the structure of the situation compared to the speech act or to the main event. I show that the parallelism between the three categories is not full: for instance, relative modality is mainly observed in triclausal constructions. Modality interpretation is sensitive to the opposition of clausal adjuncts vs. relative clauses. For the aspect interpretation, the contrast between finite forms and infinitive is relevant: infinitive allows for relative use of perfective aspect use much easier than finite forms. Finally, interpretations of the three categories are related to each other. For example, in complement clauses, the relative interpretation is perfectly acceptable for all the three categories.
T he paper presents Russian-Turkic Bilingual Corpus (RuTuBiC) design, its basic identifying features: the aim of producing a corpus, the types of texts it contains, metatextual markup and error annotation principles, technological (IT, digital) concepts. The current state and development trends of the corpus are discussed. The corpus started as an integral part of a research project intended to explore languages and cultures’ interaction dynamics in South Siberia, it embraces the recordings of Russian-Turkic (Russian-Tatar, Russian-Shor and Russian-Khakass) bilinguals’ oral speech, transcribed and error-annotated. The corpus data allow revealing mother tongue influence within the system of deviations from the speech standard in bilingual speech by means of placing them against various sources of deviations, as well as tracing the influence of social and linguistic factors on the occurrences of deviations from the speech standard.
This paper presents a system for determining semantic similarity between words that was an entry for the Dialog 2015 Russian semantic similarity competition. The system introduced is primary based on word vector models, supplemented with various other methods, both corpus- and dictionary-based. In this paper we compare performance of two methods for building word vectors (word2vec and GloVe), evaluate how performance varies on different corpus sizes and preprocessing techniques, and measure accuracy gains from supplementary methods. We compare system performance on word relatedness and word association tasks, and it turns out that different methods have varying relative importance for these tasks.
The argument constructions of adjectives has largely been out of the scope of research on semantic roles both in theoretical and IT fields. Before adding the roles of adjectival arguments to the network of semantic roles it is important to determine whether the adjectival roles form a separate list or whether they can be seen as an extension of roles assigned to the patterns of verbs and nominalizations. We discuss the general principles of how the inventory of adjectival roles should be organized in comparison with the existing inventories of verbal roles. In order to verify our statements, we carry out an experimental survey aimed at measuring the similarity between adjectival and verbal roles. The results have shown that both semantic interpretation of roles and their typical morpho-syntactic expression are significant for the evaluation and should be taken into account in working out the inventory. Besides, the specificity of adjectives lies in their prototypical stative semantics, which favors some differences in assigning a semantic role as compared to verbs. The results of the survey also provide some evidence for verification and development the inventory of verbal semantic roles.
Word sense disambiguation (WSD) methods are useful for many NLP tasks that require semantic interpretation of input. Furthermore, such methods can help estimate word sense frequencies in different corpora, which is important for lexicographic studies and language learning resources. Although previous research on Russian polysemous verbs disambiguation established some important and interesting results, it was mostly focused on reducing ambiguity or determining the most frequent sense, but not on evaluating WSD accuracy. To the best of our knowledge, there is no comprehensively evaluated method that can perform semi-supervised word sense disambiguation for Russian verbs. In this paper we present a WSD method for verbs that is able to reach an average disambiguation accuracy of 75% using only available linguistic resources: examples and collocations from the Active Dictionary of Russian and large unlabeled corpora. We evaluate the method on contexts sampled from the web-based corpus RuTenTen11 for 10 verbs with 100 contexts for each verb. We compare different variations of the method and analyze its limitations. Method’s implementation and labeled contexts are available online.
When words have several senses, it is important to describe them properly in dictionary (a lexicographic task) and to be able to distinguish them in a given context (a computational linguistics task, WSD). Different senses normally have different frequencies in corpora. We introduced several techniques for determining sense frequency based on dictionary entries matched with data from large corpora. Information about word sense frequency is not only useful for explanatory lexicography and WSD, but it also may enrich language learning resources. Learners of a foreign language who encounter a word similar to one of their native language are often tempted to assume that the foreign word and its equivalent have the same meaning structure. Sometimes, however, this is not the case, and the most frequent sense of a word in one language may be much less frequent for its cognate. We proposed a method for detecting such cases. Having selected a set of Russian words included into the Active Dictionary of Russian which have more than two dictionary senses and have cognates in English, we estimated the frequencies for English and Russian senses using SemCor and Russian National Corpus respectively, matched the senses in each pair of words and compared their frequencies. Thus we revealed cases in which the most frequent senses and whole meaning structures are, cross-linguistically, substantially different and studied them in more detail. This technique can be applied not only to cognates, but also to pairs of words which are usually offered by the dictionaries as the translation equivalents of each other.
The assumption that senses are mutually disjoint and have clear boundaries has been drawn into doubt by several linguists and psychologists. The problem of word sense granularity is widely discussed both in lexicographic and in NLP studies. We aim to study word senses in the wild—in raw corpora— by performing word sense induction (WSI). WSI is the task of automatically inducing the different senses of a given word in the form of an unsupervised learning task with senses represented as clusters of token instances. In this paper, we compared four WSI techniques: Adaptive Skip-gram (AdaGram), Latent Dirichlet Allocation (LDA), clustering of contexts and clustering of synonyms. We quantitatively and qualitatively evaluated them and performed a deep study of the AdaGram method comparing AdaGram clusters for 126 words (nouns, adjectives, and verbs) and their senses in published dictionaries. We found out that AdaGram is quite good at distinguishing homonyms and metaphoric meanings. It ignores disappearing and obsolete senses, but induces new and domain-specific senses which are sometimes absent in dictionaries. However it works better for nouns than for verbs, ignoring the structural differences (e.g. causative meanings or different government patterns). The Adagram database is available online: http://adagram.ll-cl.org/.
This study is dedicated to the problem of automatic transliteration of different Yiddish orthographies. Almost every publishing house has its own specific orthographical features and each orthography can be inconsistent. The team of the Yiddish corpus needs a tool that would standardize the variety of the writing systems. There are several types of converters but they can not meet all our needs. The converter that we created works in two steps: firstly, using the complicated rule-based system, it converts any given Yiddish text into standard orthography, secondly, it converts a text in standard Yiddish into one in Latin letters. The units engaged into our rule-based system are mostly morphemes although we used also some other letter combination that ought to be transliterated in a complicated way. Our solutions led to the accuracy of transliteration 94% of raw text and 98% of the text written in more or less standard orthography. We think its efficiency can be improved by adding a list of words of semitic origin and by methods of machine learning.
The paper deals with a phenomenon of activity participants of a dialogue. Analysis of participant’s activity in a conversation is of great importance for theoretical linguistics as well as for applied linguistics. In forensic linguistics analysis of activity can be used as an objective parameter for qualification of real communicative goals of participants. In the paper three main methods of analysis of the phenomenon discussed are introduced. The first one is a method of communicative activity, i.e. an amount of illocutionary independent speech acts of a participant in a dialogue or its relevant part. The second method is thematic activity. Analysis of thematic activity allows to elucidate who of participants independently introduces main themes in a conversation. The third method – quantitative activity – is based upon calculating of amount of words which are connected with specific theme in a conversation. In the paper different types of correlation between the three methods introduced are discussed.