Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.)
This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To investigate whether this difference might be due to an over-representation of a speaker who happens to be an outlier in terms of using a particular morpheme, we use DP, a measurement of evenness of the distribution of a specific linguistic feature across subcorpora of the same corpus.
In this paper we introduce RusDraCor — an open corpus of Russian drama for digital literary & linguistic research. The corpus (rus.dracor.org) contains plays from the middle of XVIII to the first third of XX century provided with structural (plus some semantic) markup and metadata. Texts are encoded in the XML-based standard TEI, widely used in building corpora for the humanities. We describe the contents and annotation layers of our corpus, provide some details on its development and enrichment, and finally describe three research cases. Each case demonstrates the use of RusDraCor to answer specific questions about composition, structural features and historical evolution of Russian drama.
The purpose of the paper is to investigate cues signalling the relations between discourse units in Russian. Building a lexicon of discourse connectives is an indispensable subtask in many discourse parsing applications as well as an essential issue in theoretical researches of text coherence. In order to develop such a resource for Russian, we have conducted a corpus-based study of discourse connectives that were manually extracted from the Russian Rhetorical Structure Treebank (Ru-RSTreebank). The Treebank includes 79 texts annotated within the RST framework (Mann, Thompson 1988). In order to provide a deeper analysis of connectives in Russian, we focus on causal relations only, namely, the ‘Cause-Effect’ relation. Some of the connectives (primary connectives) are enumerated in grammars and dictionaries. They primarily mark the intra-sentential relations. However, there is an expansive class of less grammaticalized items (secondary connectives) that have received less attention till now. Some of them are based on content words (e.g. по причине ‘for the cause’). Secondary connectives often serve as linking devices for inter-sentential relations. We suggest a scheme for connectives annotation for Russian. We specify the basic patterns that can be used for less-grammaticalized connectives mining in an unannotated corpus. Besides, we provide the comparison of two classes of connectives (primary vs. secondary ones). Our research has shown that these two classes differ in their properties. There is a statistically significant difference between them with respect to the nucleus/satellite position, intra- vs. inter-sentential relations and some others.
The structure of Russian everyday dialogue was studied on the basis of 73 microdialogues of everyday speech communication from the ʽOne Day of Speechʼ corpus (the ORD Corpus). The aim of the research was to find out what types of speech acts commonly initiate and complete everyday dialogues, as well as to reveal the most typical sequences of speech acts in these dialogues. Altogether, 2230 speech acts of 30 people referring to both professional, and household conversations have been analysed. N-gram analysis has been used to calculate the most frequent sequences of speech acts. The obtained results showed that dialogues are usually started by representatives, i.e. speech acts related to the exchange of information (38% of all cases), etiquette beginnings (greetings, vocatives) take place in 23% of the dialogues, and in 19% of cases the conversation begins with a regulative form. Speech acts ending dialogues show a greater variety: representatives contribute 2% of all dialogue ends, valuative judgments and regulatory forms cover 14% each, further go directives (8%), commissions (8%), etiquette forms (8%) and emotional and expressive form (7%). As for the most typical bigrams of speech acts, they are the following: two consecutive representatives (22.35%), a regulatory form followed by a representative (6.93%), a representative and a regulatory form (6%), a valuative with a following representative (5.21%), a representative and a valuative judgment (4.77%), as well as two combinations of a directive with a representative (2.77% each). Besides, the article presents data on the occurrence of the most frequent pairs of speech acts at the subtype level. Here, the most frequent one is the sequence ʽquestionʼ+ʽanswerʼ, which covers 2.45%.
This paper studies the impact corpus size has on the robustness of vari
ous frequency-based measures of corpus distance (or similarity, respec
tively), such as Euclidean distance, Manhattan distance, Cosine distance,
χ², Spearman’s ρ, and Simple-Maths Keyword distance. An experiment
performed using the British National Corpus shows that Euclidean distance
is least influenced by corpus size and thus is best suited for the purpose
of comparing corpora
Subject index, or back-of-the-book index, is a device intended to provide an easy access to relevant fragments of a text document. Subject indexes usually contain particular single-word and multi-word terms from the corresponding documents. Such indexes are especially useful for reading large documents with specialized terminology, as well as educational texts in difficult scientific and technical areas. The central problem of back-of-the-book indexing is recognition of terms to be included into the index. The paper describes a method developed for extracting and filtering terms from a given educational scientific text, with the purpose of reliable term selection in computer indexing systems. The method is primarily based on rules with lexico-syntactic patterns representing linguistic information about terms and typical contexts of their usage in Russian scientific and educational texts; simple occurrences statistics of terms is used as well. Experimental evaluation of the method has shown a considerable increase of precision and recall of term extraction compared with the widely-used standard techniques.
This paper is a first step towards a corpus-based description of the semantics of Russian pronouns in intensional contexts. Having justified the use of corpus in (formal) semantic research, I delineate a particular issue within the topic: whether a given pronoun is interpreted de se or de re in counteridentity contexts.
A counteridentity context is a clause within the scope of a counterfactual (clause or adverbial) that affects the identity of a real individual, e.g. if I were you, were I you, etc. If a pronoun such as I, my or the Russian reflexive possessive svoj is used in such a context, two options are theoretically possible: either it picks out the speaker’s real self (de re), or it refers to the identity assumed by the speaker in the contrary-to-fact situations introduced by the counterfactual (de se).
Using data from the GICR corpus (approx. 20 billion tokens), I show that for the Russian first-person singular pronoun ja and its corresponding possessive moj, de se reference is possible but de re interpretation is more frequent. The opposite holds for the reflexive sebja, whereas svoj is interpreted de se with no exception. Special attention is paid to situations where more than one referential strategy is possible. The paper concludes with a couple of observations relevant for the future formal accounts of de se reference.