Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т.
Manually annotated corpora are very important and very expensive resources: the annotation process requires a lot of time and skills. In Open- Corpora project we are trying to involve into annotation works native speakers with no special linguistic knowledge. In this paper we describe the way we organize our processes in order to maintain high quality of annotation and report on our preliminary results.
The paper continues research into words denoting everyday life objects in the Russian language. This research is conducted for developing a new encyclopedic thesaurus of Russian everyday life terminology. Working on this project brings up linguistic material which leads to discovering new trends and phenomena not covered by the existing dictionaries. We discuss derivation models which gain polularity: clipped forms (komp < komp’juter ‘computer’, nout < noutbuk ‘notebook computer’, vel < velosiped ‘bicycle’, mot<motocikl ‘motorbike’), competing masculine and feminine con- tracted nouns derived from adjectival noun phrases (mobil’nik (m.) / mo- bilka (f.) < mobil’nyj telefon (m.) ‘mobile phone’, zarjadnik (m.) / zarjadka (f.) < zarjadnoe ustrojstvo (n.) ‘AC charger’), hybrid compounds (plat’e- sviter ‘sweater dress’, jubka-brjuki ‘skirt pants’, shapkosharf ‘scarf hat’, vilkolozhka ‘spork, foon’). These words vary in spelling and syntactic behav- iour. We describe a newly formed series of words denoted multifunctional objects: mfushkaZ< MFU < mnogofunkcional’noe ustrojstvo ‘MFD, multi- function device’, mul’titul ‘multitool’, centr ‘unit, set’. Explaining the need to compose frequency lists of word meanings rather than just words, we of- fer a technique for gathering such lists and provide a sample produced from our own data. We also analyze existing dictionaries and perform various experiments to study the changes in word meanings and their comparative importance for speakers. We believe that, apart from the practical usage for our lexicographic project, our results might prove interesting for research in the evolution of the Russian lexical system.
The article presents the Typological Database of Qualities, which aims at providing a new tool for research in lexical typology. The database contains information on the lexicalization of several semantic fields of adjectives in different languages (like ‘sharp’ — ‘blunt’, ‘empty’ — ‘full’, ‘solid’ — ‘soft’, ‘thick’ — ‘thin’, ‘smooth’ — ‘rough’, etc.). We discuss issues concerning database structure (in particular, the choice of information units that would make the meanings from different languages comparable to each other). Special attention is devoted to the representation of figurative meanings in the Database which allows to investigate the models of their derivation from the literal meanings. The developed database can be used for solving both theoretical and practical tasks. On the practical level, the Database may serve as a multilingual dictionary which accounts for fine-grained differences in meaning between individual words. On the theoretical side, the Database allows for various generalizations on cross-linguistic patterns of polysemy and semantic change.
The paper considers semantic structure of emotion causatives and their interaction with negation, namely, its narrow or wide scope. Emotion causatives are defined as a group of causatives with their specific semantic properties that distinguish them from other groups of causatives. One of those properties concerns their relation with corresponding decausatives, which, unlike causatives, do not license wide scope of negation. There are several factors that enable negation to have scope over the causative element in emotion causatives – their imperfective aspect, generic referential status of the causative NP phrase, agentivity and conativity of the causative. Non-agentive causatives never license the negation of the causative component. Agentive conative causatives license the negation of the causative component more frequently and easily than agentive non-conative causatives, prompting the assumption that in their semantic structures the causative component has different statuses (assertion in the former, presupposition in the latter). It also has different forms for conatives and non-conatives. Conativity vs. non-conativity of emotion causatives is related to the emotion type, with conative synthetic causatives being limited to basic emotions. The greatest degree of conativity and, hence, the assertive status of the causative component characterizes three emotion causatives – zlit’ ‘to make mad’, veselit’ ‘to cheer up’, and pugat’ ‘to frighten’.
Key words: causative, decausative, agentive, conative, intentional, presupposition, assertion, semantic structure, basic emotions
The article deals, in a typological perspective, with verbs describing sounds of inanimate objects (cf. the noise of a door being opened, of coins in somebody’s pocket, of a river, etc.). The analysis is based on the data from four languages (Russian, German, Komi-Zyrjan, Khanty), which were obtained from dictionaries, corpora and field investigation. We discuss, first, the primary meanings of these verbs and identify the parameters that underlie semantic distinctions between them (type of sound source and its features, type of situation causing the emission of a sound, acoustic properties of sounds). Then we consider the derived meanings of sound verbs, which are developed through metonymic and metaphoric shifts and analyze the mechanisms behind each of these shifts. Finally, we examine a type of semantic change in our data which cannot be explained in terms of either of those mechanisms and hence represents a separate kind of meaning shift.
The paper presents a semantic and pragmatic analysis of noun reduplication in colloquial Russian and the Internet language. We consider the repetition of a noun within the same prosodic unit separated by a particle “takoj” (‘such’) as in “statja takaja statja” (‘paper such a paper’). Drawing on a corpus of examples gathered from Internet texts we categorize the semantics of this reduplication pattern into six types: (1) prototype and connotation, (2) non-fitting a stereotype, (3) condescension and irony, (4) expression of emotions, (5) discourse topic and scene-setting topic (6) object nomination and ellipsis. Compared to the model “such X-X”, the model “X such X” more often points to the negative attitude. We also consider the syntactic structure of the given reduplication pattern.
A new electronic frequency dictionary shows the distribution of grammatical forms in the inflectional paradigm of Russian nouns, adjectives and verbs, i.e. the grammatical profile of individual lexemes and lexical groups. While the frequency hierarchy of grammatical categories (e.g. the frequency of part of speech classes or the average ratio of Nominative to Instrumental case forms) has long been the standard topic of research, the present project shifts the focus to the distribution of grammatical forms in particular lexical units. Of particular concern are words with certain biases in grammatical profile, e.g. verbs used mostly in Imperative, in past neutral or nouns used often in plural. The dictionary will be a source for many of the future research in the area of Russian grammar, paradigm structure, grammatical semantics, as well as variation of grammatical forms.
The resource is based on the data of the Russian National Corpus. The article addresses some general issues such as corpora use in compiling frequency resources and technology of corpus data processing. We suggest certain solutions related to the selection of data and the level of granularity of grammatical profile. Text creation time and language registers are discussed as parameters which may shape the grammatical profile fluctuations.
Our research aims at automatic identification of constructions associated with particular lexical items and its subsequent use in building the catalogue of Russian lexical constructions. The study is based on the data extracted from the Russian National Corpus (RNC, http://ruscorpora.ru). The main accent is made on extensive use of morphological and lexico-semantic data drawn from the multi-level corpus annotation. Lexical constructions are regarded as the most frequent combinations of a target word and corpus tags which regularly occur within a certain left and/or right context and mark a given meaning of a target word. We focus on nominal constructions with target lexemes that refer to speech acts, emotions, and instruments. The toolkit that processes corpus samples and learns up the constructions is described. We provide analysis for the structure and content of extracted constructions (e.g. r:ord der:num t:ord r:qual|pervyj ‘first’ + LJUBOV’ ‘love’; LJUBOV’ ‘love’ + PR|s ‘from’ + ANUM m sg gen|pervyj ‘first’ + S f inan sg gen|vzgljad ‘sight’ = love at first sight). As regards their structure, constructions may be considered as n-grams (n is 2 to 5). The representation of constructions is bipartite as they may combine either morphological and lemma tags or lexical-semantic and lemma tags. We discuss the use of visualization module PATTERN.GRAPH that represents the inner structure of extracted constructions.
The Information Extraction task and the task of Named Entities recognition (NER) in unstructured texts in particular, are essential for modern Mass Media systems. The paper presents a case study of NER system for Russian. The system was built and tested on the Russian news texts. The method of ambiguity resolution under discussion is based on dictionaries and heuristic rules. The dictionary-oriented approach is motivated by the set of strict initial requirements. First, the target set of Named Entities should be extracted with very high precision; second, the system should be easily adapted to a new domain by non-specialists; and third, these updates should result in the same high precision. We focus on the architecture of the dictionaries and on the properties that the dictionaries should have for each class of Named Entities in order to resolve ambiguous situations. The five classes under consideration are Person, Location, Organization, Product and Named Event. The properties and structure of synonyms and context words, expressions and entities necessary for disambiguation are discussed.
The paper proposes a substantial classification of collocates (pairs of words that tend to cooccur) along with heuristics that can help to attibute a word pair to a proper type automatically.
The best studied type is frequent phrases, which includes idioms, lexicographic collocations, and syntactic selection. Pairs of this type are known to occur at a short distance and can be singled out by choosing a narrow window for collecting cooccurrence data.
The next most salient type is topically related pairs. These can be identified by considering word frequencies in individual documents, as in the wellknown distributional topic models.
The third type is pairs that occur in repeated text fragments such as popular quotes of standard legal formulae. The characteristic feature of these is that the fragment contains several aligned words that are repeated in the same sequence. Such pairs are normally filtered out for most practical purposes, but filtering is usually applied only to exact repeats; we propose a method of capturing inexact repetition.
Hypothetically one could also expect to find a forth type, collocate pairs linked by an intrinsic semantic relation or a long-distance syntactic relation; such a link would guarantee co-occurrence at a certain relatively restricted range of distances, a range narrower than in case of a purely topical connection, but not so narrow as in repeats. However we do not find many cases of this sort in the preliminary empirical study.
Automatic verb-noun collocation extraction is an important natural language processing task. The results obtained in this area of research can be used in a variety of applications including language modeling, thesaurus building, semantic role labeling, and machine translation. Our paper de-scribes an experiment aimed at comparing the verb-noun collocation lists extracted from a large corpus using a raw word order-based and a syntax-based approach. The hypothesis was that the latter method would result in less noisy and more exhaustive collocation sets. The experiment has shown that the collocation sets obtained using the two methods have a surprisingly low degree of correspondence. Moreover, the collocate lists extracted by means of the window-based method are often more complete than the ones obtained by means of the syntax-based algorithm, despite its ability to filter out adjacent collocates and reach the distant ones. In order to interpret these differences, we provide a qualitative analysis of some common mismatch cases.
In Daghestan, the number of Russian speakers has been dramatically increasing over the last few decades. Russian has assumed the functional niche previously vacant in this extremely multilingual setting, becoming the first ever lingua franca of the region as a whole. Russian is acquired in a situation of strong interaction with local languages and shows contact properties on various linguistic levels: phonetics, morphology, syntax and lexicon. Its regional variant is also visibly developing as a self-identification device. The aim of this paper to discuss some (socio)linguistic properties of this idiom, attribute them either to interference or to imperfect learning, and to argue for building a corpus of Daghestanian Russian.
We develop a graph representation and learning technique for parse structures for sentences and paragraphs of text. We introduce parse thicket as a set of syntactic parse trees augmented by a number of arcs for intersentence word-word relations such as coreference and taxonomies. These arcs are also derived from other sources, including Rhetoric Structure and Speech Act theory. We introduce respective indexing rules that identify inter- sentence relations and join phrases connected by these relations in the search index. We propose an algorithm for computing parse thickets from parse trees. We develop a framework for automatic building and generalizing of parse thickets. The proposed approach is used for evaluation in the product search where search queries include multiple sentences. We draw the comparison for search relevance improvement by pair-wise sentence generalization and thicket-level generalization.
We present a database developed for lexico-typological study of expressions of pain. Its design implements the non-relational, NoSql approach, where data is organized not into a table but into a flexible tree not limited in size and depth. Linguistic annotation is placed directly into the text of example sentences and their translations, so that in effect the database is structured as an annotated corpus. This formalism gives much freedom to both the developers in their task of annotating examples, and users in their queries, since it allows them to vary the level of detail according to how much information is available or needed. Linguistic annotation includes tags for syntactic roles, some syntactic constructions and their components (relative clauses, light verbs, formal subjects, parts of compound words), morphological information (tags for case, number, aspect etc), as well as semantic tags specific to the domain of pain (semantic roles and types of metaphoric shift).
A new technology is proposed for wide search applications to natural language texts. Its particular application to an expert search task is considered in details on the example of TREC Enterprise track. The vocabulary is treated statistically, but, as opposed to a standard TFIDF metric, two special metrics are used. They involve into calculations information about lexicon usage by authors and communications between them. Calculating connection cardinality between an author and lexicon enables to reveal definite terms which are characteristic for an author so this author can be found with the help of such terms. Lexicon weighing allows to extract from the whole collection a small portion of vocabulary which we name significant. The significant lexicon enables to effectively search in thematically specialized knowledge field. Thus, our search engine minimizes the lexicon necessary for answering a query by extracting the most important part from it. The ranking function takes into account term usage statistics among authors to raise role of significant terms in comparison with others, more noisy ones. We demonstrate the possibility of effective expertise retrieval owing to several rationally built heuristic rating indicators. First, we receive an expert search efficiency that is comparable with the most effective modern information retrieval engines. Second, the chosen indicators allow to distinguish between “good” and “bad” queries. This is essentially important for further optimization of our engine. We discuss the possibility of applying our engine to other search and analytic scenarios such as plagiarism search, information gap retrieval and others.