Proceedings of COLING 2012: Posters
This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www.eanc.net) and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format.
UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.
The paper reports on the recent forum RU-EVAL - a new initiative for evaluation of Russian NLP resources, methods and toolkits. It started in 2010 with evaluation of morphological parsers, and the second event RU-EVAL 2012 (2011-2012) focused on syntactic parsing. Eight participating IT companies and academic institutions submitted their results for corpus parsing. We discuss the results of this evaluation and describe the so-called “soft” evaluation principles that allowed us to compare output dependency trees, which varied greatly depending on theoretical approaches, parsing methods, tag sets, and dependency orientations principles, adopted by the participants.