Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, 17-23 May 2010
The Bank of Russian Constructions and Valencies (Russian FrameBank) is an annotation project that takes as input samples from the Russian National Corpus (http://www.ruscorpora.ru). Since Russian verbs and predicates from other POS classes have their particular and not always predictable case pattern, these words and their argument structures are to be described as lexical constructions. The slots of partially filled phrasal constructions (e.g. vzjal i uexal ‘he suddenly (lit. took and) went away’) are also under analysis. Thus, the notion of construction is understood in the sense of Fillmore’s Construction Grammar and is not limited to that of argument structure of verbs. FrameBank brings together the dictionary of constructions and the annotated collection of examples. Our goal is to mark the set of arguments and adjuncts of a certain construction. The main focus is on realization of the elements in the running text, to facilitate searches through pattern realizations by a certain combination of features. The relevant dataset involves lexical, POS and other morphosyntactic tags, semantic classes, as well as grammatical constructions that introduce or license the use of elements within a given construction.
The paper describes the structure and possible applications of the theory of K-representations (knowledge representations) in bioinformatics and in the development of a Semantic Web of a new generation. It is an original theory of designing semantic-syntactic analyzers of natural language (NL) texts with the broad use of formal means for representing input, intermediary, and output data. The current version of the theory is set forth in a monograph by V. Fomichov (Springer, 2010). The first part of the theory is a formal model describing a system consisting of ten operations on conceptual structures. This model defines a new class of formal languages – the class of SK-languages. The broad possibilities of constructing semantic representations of complex discourses pertaining to biology are shown. A new formal approach to developing multilingual algorithms of semantic-syntactic analysis of NL-texts is outlined. This approach is realized by means of a program in the language PYTHON.
This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www.eanc.net) and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format.
UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.
The book contains the proceedigs of the 18th International Conference on Automatic Processing of Natural Langage (France, Montpellie, 27th June - 1st July 2011).
Four electronic corpora created in 2011 within the framework of the “Corpus Linguistics: the Albanian, Kalmyk, Lezgian, and Ossetic Languages” Program of Fundamental Research of the RAS are presented. The interface and functionalities of these corpora are described, engineering problems to be solved in their creation are elucidated, and the promises of their development are discussed. A particular emphasis is made on the compilation of dictionaries and automatic grammatical markup of the corpora.
The aim of the article is to inform professional readership of corpus analysis potential for L2 teaching purposes, which is based on our own implementation experience of corpus-based activities in L2 classroom. Thus, the paper is divided into four sections, including Introduction (1), Corpus Tools (2), Examples of Classroom Use (3) and Conclusion (4). Introduction outlines the recent corpus-driven changes in attitudes to language statistics, which are reflected in corpus-informed text books. Section Two, which has nine subsections, deals with corpus tools and notions of corpus analysis (concordance, collocation and colligation search, corpus statistics, semantic prosody etc.) in L2 teaching context. In particular, we discuss condensed reading, concordance vertical scanning for lexico-grammatical profiling and other teaching tools to develop L2 linguistic competence. These are later supported (Section 3) by some corpus-oriented classroom activities with possible teaching outcomes outlined. Some experience-based comments are also given regarding the language level of students that could benefit from corpus data analysis. Based on our research results, conclusion elaborates on the idea of corpus competence as well as the necessity of corpus tools to be used by both language teaching professionals and students.
Theoretically, as was initially suggested by data-driven teaching pioneers, not only the researcher, but also the learner should be given the chance of studying language through corpus. The article advocates that corpus tools for collocation search together with colligation detection (or probable grammar structures) are powerful means to develop both language and research skills.
In addition to corpus – based activities and theoretical grounding behind that, we also shared our own experience on compiling a corpora of professional discourse. Both the idea and the practicality of a small university – made corpus are evaluated. A brief comparison of a diversified corpus (such as the British National Corpus) versus a “home”- made corpus is provided.
The research also draws attention to a term chunk of language, which has been adopted by western teaching methodology, and is being considered in the paper through frequency – probability (corpus) terms. It is suggested that a chunk of language bigger than a collocation lends itself to being discovered through the combination of various corpus tools. These frequent language chunks (such as there is certain stigma attached to...) account for a large part of native speaker’s vocabulary and fluency. They are believed to be stored in memory in great amounts and retrieved virtually undivided. Chunks could be subject to minor colligational corrections while speaking. We believe that discovering frequent language chunks by language learners could be done as an educational research activity under the guidance of a language instructor. As has already been mentioned, the article provides some research activity examples in Section 3.
Thus, the article will equip the reader with a clear understanding of corpus linguistic potential in the foreign language classroom as well as with the capacity and confidence to engage in corpus analysis. It might be particularly beneficial for non-native speakers of English who happen to teach English in ESP context, since we believe that corpus research ends the monopoly of language intuition, which in today’s world is being gradually replaced by corpus statistics.
This paper is an overview of the current issues and tendencies in Computational linguistics. The overview is based on the materials of the conference on computational linguistics COLING’2012. The modern approaches to the traditional NLP domains such as pos-tagging, syntactic parsing, machine translation are discussed. The highlights of automated information extraction, such as fact extraction, opinion mining are also in focus. The main tendency of modern technologies in Computational linguistics is to accumulate the higher level of linguistic analysis (discourse analysis, cognitive modeling) in the models and to combine machine learning technologies with the algorithmic methods on the basis of deep expert linguistic knowledge.
The project we present – Russian Learner Translator Corpus (RusLTC) is a multiple learner translator corpus which stores Russian students’ translations out of English and into it. The project is being developed by a cross-functional team of translator trainers and computational linguists in Russia. Translations are collected from several Russian universities; all translations are made as part of routine and exam assignments or as submissions for translation contests by students majoring in translation. As of March 2014 RusLTC contains the total of nearly 1.2 million word tokens, 258 source texts, and 1,795 translations. The paper gives a brief overview of the related research, describes the corpus structure and corpus-building technologies used; it also covers the query tool features and our error annotation solutions. In the final part we make a summary of the RusLTC-based research, its current practical applications and suggest research prospects and possibilities.
This workshop is about major challenges in the overall process of MWE treatment, both from the theoretical and the computational viewpoint, focusing on original research related to the following topics:Manually and automatically constructed resources Representation of MWEs in dictionaries and ontologies MWEs in linguistic theories like HPSG, LFG and minimalism MWEs and user interaction Multilingual acquisition Multilingualism and MWE processing Models of first and second language acquisition of MWEs Crosslinguistic studies on MWEs The role of MWEs in the domain adaptation of parsers Integration of MWEs into NLP applications Evaluation of MWE treatment techniques Lexical, syntactic or semantic aspects of MWEs
The paper is focused on the study of reaction of italian literature critics on the publication of the Boris Pasternak's novel "Doctor Jivago". The analysys of the book ""Doctor Jivago", Pasternak, 1958, Italy" (published in Russian language in "Reka vremen", 2012, in Moscow) is given. The papers of italian writers, critics and historians of literature, who reacted immediately upon the publication of the novel (A. Moravia, I. Calvino, F.Fortini, C. Cassola, C. Salinari ecc.) are studied and analised.
In the article the patterns of the realization of emotional utterances in dialogic and monologic speech are described. The author pays special attention to the characteristic features of the speech of a speaker feeling psychic tension and to the compositional-pragmatic peculiarities of dialogic and monologic text.