Coreference in Russian Oral Movie Retellings (the Experience of Coreference Relations Annotation in “Russian CliPS ” corpus)
PRE-CogSci 2013 is a follow-up to two successful earlier workshops on the production of referring expressions. The first, PRE-CogSci 2009, focussed on the interplay between computational and empirical methods, organised as part of the 31st CogSci conference in Amsterdam. The second, PRE-CogSci 2011 in Boston, broadened this theme to include work on dialogue and linguistic theory. We explore new directions for computational and cognitive work (e.g., collaborative reference, nondeterminism in production, interaction between comprehension and production, combinations with research on vision).
Abstract. There is currently a great need for modern, standardized neuropsychological tests for language assessment in Russian speakers with aphasia. Our group is working on the development of the Russian Aphasia Test (RAT). Within the scope of this work, two subtests for single-word comprehension of nouns and verbs were developed considering contemporary models of language processing and principles of psychometrics. The task for both subtests was spoken word-to-picture matching. The subtests were normed on individuals with aphasia (n = 45) and a control group (n = 30). This resulted in the final set of 30 diagnostic trials for nouns and verbs matched on relevant psychometric properties which are sensitive to language impairments for both fluent and non-fluent types of aphasia. This set of trials will be included in the final version of the RAT.
Referential choice is the process of selecting an appropriate referential expression for a referent that the speaker/writer intends to mention at some point in discourse. Referential choice is governed by the referent's current status in the speaker's/writer's working memory. This status, in turn, is determined by a number of factors, rooted in discourse context and referent's properties. Activation in working memory is immediately responsible for the coarse choice between full and reduced referential devices, which is the high level distinction in the hierarchical organization of referential choice. Lower levels of granularity correspond to the choice between proper names and description, and still more refined options. Referential choice is a multi-factorial process. We have created a corpus of written texts in which many potentially relevant factors of referential choice are annotated. We also use another corpus in which the same texts are annotated for discourse structure, as it is known that rhetorical distance, measured on the basis of hierarchical discourse structure, is a powerful factor of referential choice. We have modeled referential choice in the corpus with the help of a variety of machine learning algorithms. The accuracy of prediction for the choice between full and reduced referential devices is close to 90%, and for the three-way choice between pronouns, descriptions, and proper names it is close to 80%. We experimented with the reduction of the set of factors and explored the phenomenon of non-categorical that is probabilistic, referential choice.
This dissertation analyzes the reflexivity patterns in Uralic languages from the point of view of a minimalist approach to binding. The languages under consideration are five Uralic languages spoken in the Russian Federation: Meadow Mari, Komi-Zyrian, Khanty, Besermyan Udmurt, and Erzya. The empirical data were compiled during fieldwork, and are used to test and assess current approaches to binding. The main focus of the dissertation is on a number of puzzles posed by these languages, namely the locally bound pronominals in Khanty, as well as the binding domains of what I call semi-reflexives and their ability to take split antecedents in Meadow Mari, Komi-Zyrian, Besermyan Udmurt, and Erzya. The analysis of reflexive strategies proposed in this dissertation is based on a modular approach to binding (see Reuland 2011). It disentangles the various factors playing a role in establishing interpretive dependencies, including properties of predicates and syntactic chains. The puzzling behavior of reflexive strategies under discussion is accounted for in terms of their morphosyntactic composition in tandem with general properties of grammatical computation. The present approach provides a unified basis for verbal and nominal reflexives. Overall, the study shows that cross-linguistic variation is not random. It demonstrates how descriptive fieldwork and theoretical research can be mutually beneficial and how their symbiosis deepens our understanding of the general principles underlying language, and the way these are rooted in our cognitive system.
Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents the first steps taken by Russian corpus linguistics toward the development of language corpora and corpus-based resources as well as their use in grammatical and lexical analysis.
The first part of the book focuses on the annotation of Russian texts at several levels: lemmas, part of speech and inflectional forms, word formation, lexical-semantic classes, syntactic dependencies, semantic roles, frames, and lexical constructions. We discuss various theoretical principles and practical considerations motivating the corpus markup design, provide details on the creation of lexical resources (electronic dictionaries and databases) and text processing software, and consider complicated cases that present challenges for the annotation of corpora both manually and automatically. In most cases we describe the annotation of the Russian National Corpus (RNC, ruscorpora.ru) and its affiliate project FrameBank (framebank.ru).
Frequency data depend not only on the representativeness and balance of texts in a corpus, but also on the rules and tools used for annotation. The book addresses the development of evaluation standards for Russian NLP resources, namely, morphological taggers and dependency parsers. In addition, the book presents several experiments on automatic annotation and disambiguation: lemmatization of word forms not in the dic- tionary; word sense disambiguation based on vectors formed by lexical, semantic and grammatical cues of context; and semantic role labeling.
The final chapters of the first part of the book outline two types of frequency dictionaries based on the RNC data: a general-purpose frequency dictionary and a lexico-grammatical one.
The second part of the book presents an analysis of corpus data and includes a number of case studies of Russian grammar and lexical-grammatical interaction using quantitative methods. The key concept underlying our analysis is the behavioral profile (Hanks 1996; Divjak, Gries 2006), which is the frequency distribution of variable elements in a linguistic unit as attested in a corpus. This covers grammatical profiles (the frequency distribution of inflected forms of a word), constructional profiles (the frequency distri- bution of argument or any other constructions attested for a key predicate), lexical and semantic profiles (the frequency distribution of words and lexical-semantic classes in construction slots or, more generally, in the context of a word), and radial category profiles (the frequency distribution of word senses and word uses across the radial category network of a polysemous unit). We use grammatical, constructional, semantic, and radial category profiling to study tense, aspect and mood specialization of Russian verb forms; to identify singular-oriented and plural-oriented nouns; to investigate factors for prefix choice and prefix variation in natural perfectives (chistovidovye perfectivy); to analyze constraints on the filling of slots in a construction and how this affects the meaning of the construction, taking as an example the Genitive construction of shape and the spatial construction with the preposition poverkh ‘up and over’.
The quantitative corpus-based techniques used for the analysis vary from simple descriptive statistics (e. g., absolute frequencies, percentages, measures of the central ten- dency and outliers) to exact Fisher test and logistic regression. We claim that the vector modeling approaches to quantitative grammatical studies in theoretical linguistics are no less effective than in computational linguistics, where they have become a standard tool.
The choice of an appropriate referential expression (definite description, proper name or pronoun) depends on multiple factors. This paper focuses on how the possessor position of a referential expression and its antecedent affect referential choice. Other factors, such as syntactical role, form and definiteness of the antecedent, and animacy of the referent are considered. The study is based on a subcorpus of the specially designed RefRhet corpus.
The paper focuses on the paths of grammaticalization of the verb of speech manaš (‘say’, ‘name’) in Eastern Mari. The converb of this verb (manən) is desemantisized, it loses the syntactic properties of the verb of speech and shifts to the category of subordinators. Successive grammaticalization steps of this marker can be observed in Modern Mari: in some contexts it functions as a quotation marker, while in others as a subordinator. We suggest two paths of grammaticalization of this form on the basis of the given analysis: the fi rst path involves the context of verbs of speech, mental and emotive complementtaking predicates, the second path involves the contexts of causation and potential situation (in complementation), purpose and causal adverbial clauses. The argumentation for this grammaticalization pattern is based on the constraints on subordinate predicate encoding (acceptability of non-fi nite clauses with manən), the choice of pronouns [we focus on the choice of the anaphoric vs. deictic strategy of encoding the textual («original» in [Aikhenvald 2008]) speaker and hearer] and the mood of the verb in the complement clause. We show that in Modern Mari the analyzed form can have the following functions: as a quotation marker, as a subordinator in complement and adverbial clauses, as a discourse marker of hesitation and autocorrection, and as a semantically empty subordinator that is used to express negation with the infi nitive.