Coreference Chains in Czech, English and Russian: Preliminary Findings.
This paper is a pilot comparative study on coreference chaining in three languages, namely, Czech, English and Russian. We have analyzed 16 parallel English-Czech newspaper texts and 16 texts in Russian (similar to the English-Czech ones in length and topics). Our motivation was to find out what the linguistic structure of coreference chains in different languages is and what types of distinctions we should take into account for advancing the development of systems for coreference resolution. Taking into account theoretical approaches to the phenomenon of coreference we based our research on the following assumption: the recognition of coreference links for different structural types of noun phrases is regulated by different language mechanisms. The other starting point was that different languages allow pronominal chaining of different length and that coreference chains properties differ for the languages with different strategies for zero anaphora and different systems for definiteness marking. This work reports our first findings within the task of the structural NP types’ distribution comparison in three languages under analysis.
This dissertation analyzes the reflexivity patterns in Uralic languages from the point of view of a minimalist approach to binding. The languages under consideration are five Uralic languages spoken in the Russian Federation: Meadow Mari, Komi-Zyrian, Khanty, Besermyan Udmurt, and Erzya. The empirical data were compiled during fieldwork, and are used to test and assess current approaches to binding. The main focus of the dissertation is on a number of puzzles posed by these languages, namely the locally bound pronominals in Khanty, as well as the binding domains of what I call semi-reflexives and their ability to take split antecedents in Meadow Mari, Komi-Zyrian, Besermyan Udmurt, and Erzya. The analysis of reflexive strategies proposed in this dissertation is based on a modular approach to binding (see Reuland 2011). It disentangles the various factors playing a role in establishing interpretive dependencies, including properties of predicates and syntactic chains. The puzzling behavior of reflexive strategies under discussion is accounted for in terms of their morphosyntactic composition in tandem with general properties of grammatical computation. The present approach provides a unified basis for verbal and nominal reflexives. Overall, the study shows that cross-linguistic variation is not random. It demonstrates how descriptive fieldwork and theoretical research can be mutually beneficial and how their symbiosis deepens our understanding of the general principles underlying language, and the way these are rooted in our cognitive system.
Abstract - RU-EVAL is a biennial event organized in order to estimate the state of the art in Russian NLP resources, methods and toolkits and to compare various methods and principles implemented for Russian. Russian could be treated as an under-resourced language due to the lack of free distributable gold standard corpora for different NLP tasks (each team tried to work out their own standards). Thus, our goal was to work out the uniform basis for comparison of systems based on different theoretical and engineering approaches, to build evaluation resources, to provide a flexible system of evaluation in order to differentiate between non-acceptable and linguistically “admissible” errors. The paper reports on three events devoted to morphological tagging, dependency parsing and anaphora resolution, respectively.
This paper concerns discourse-new mention detection in Russian. This might be helpful for different NLP applications such as coreference resolution, protagonist identification, summarization and different tasks of information extraction to detect the mention of an entity newly introduced into discourse. In our work, we are dealing with the Russian where there is no grammatical devices, like articles in English, for the overt marking a newly introduced referent. Our aim is to check the impact of various features on this task. The focus is on specific devices for introducing a new discourse prominent referent in Russian specified in theoretical studies. We conduct a pilot study of features impact and provide a series of experiments on detecting the first mention of a referent in a non-singleton coreference chain, drawing on linguistic insights about how a prominent entity introduced into discourse is affected by structural, morphological and lexical features.
The paper reports on the recent forum RU-EVAL ‒ a new initiative for evaluation of Russian NLP resources, methods and toolkits. The first two events were devoted to morphological and syntactic parsing correspondingly. The third event is devoted to anaphora and coreference resolution. Seven participating IT companies and academic institutions submitted their results for anaphora resolution task and three of them presented the results of coreference resolution task as well. The event was organized in order to estimate the state of the art for this NLP task in Russian and to compare various methods and principles implemented for Russian. We discuss the evaluation procedure. The anaphora and coreference tasks are specified in the present work. The phenomena taken into consideration are described. We also give a brief outlook of the similar evaluation events whose experience we lay upon. In our work we formulate the training and Gold Standard corpora construction guidelines and present the measures used in evaluation.
Many NLP researchers, especially those not working in the area of discourse processing, tend to equate coreference resolution with the sort of coreference that people did in MUC, ACE, and OntoNotes, having the impression that coreference is a well-worn task owing in part to the large number of papers reporting results on the MUC/ACE/OntoNotes corpora. Given the plethora of work on entity coreference and aware of other fora gathering coreferencerelated papers (such as LAW, DiscoMT or EVENTS), we believed that time was ripe for a new workshop on the single topic of coreference resolution that would bring together researchers who were interested in under-investigated coreference phenomena, willing to contribute both theoretical and applied computational work on coreference resolution, especially for languages other than English, less-researched forms of coreference and new applications of coreference resolution.
The paper concerns discourse-new referent detection. The task of coreference resolution is essential in many text-mining applications. The focus in this task is to detect noun phrases (NPs) that refer to the same entity. In languages without articles, there are no overt grammatical clues in an NP for whether it introduces a new referent into discourse or it refers to one of before-mentioned entities. However, there are some theoretical researches which claim that referent first-mentioning NPs have some specific features. In our research, we examine features that serve as discourse-new detectors for NPs corresponding to discourse salient referents and provide an experiment on different features contribution to this detection. The first-mention detection could help the quality of coreference resolution systems.
The paper focuses on the paths of grammaticalization of the verb of speech manaš (‘say’, ‘name’) in Eastern Mari. The converb of this verb (manən) is desemantisized, it loses the syntactic properties of the verb of speech and shifts to the category of subordinators. Successive grammaticalization steps of this marker can be observed in Modern Mari: in some contexts it functions as a quotation marker, while in others as a subordinator. We suggest two paths of grammaticalization of this form on the basis of the given analysis: the fi rst path involves the context of verbs of speech, mental and emotive complementtaking predicates, the second path involves the contexts of causation and potential situation (in complementation), purpose and causal adverbial clauses. The argumentation for this grammaticalization pattern is based on the constraints on subordinate predicate encoding (acceptability of non-fi nite clauses with manən), the choice of pronouns [we focus on the choice of the anaphoric vs. deictic strategy of encoding the textual («original» in [Aikhenvald 2008]) speaker and hearer] and the mood of the verb in the complement clause. We show that in Modern Mari the analyzed form can have the following functions: as a quotation marker, as a subordinator in complement and adverbial clauses, as a discourse marker of hesitation and autocorrection, and as a semantically empty subordinator that is used to express negation with the infi nitive.