Корпусный анализ русского стиха
Four electronic corpora created in 2011 within the framework of the “Corpus Linguistics: the Albanian, Kalmyk, Lezgian, and Ossetic Languages” Program of Fundamental Research of the RAS are presented. The interface and functionalities of these corpora are described, engineering problems to be solved in their creation are elucidated, and the promises of their development are discussed. A particular emphasis is made on the compilation of dictionaries and automatic grammatical markup of the corpora.
This paper is toward the system of automatic text summarization developed by «DC – Systems» company in cooperation with the faculty of computer science at HSE. The summary is a concise description of the text in terms of its content and meaning, i.e. from the point of view of its semantics. The purpose of the summarization is to reduce the text as much as possible while maintaining the main content. A summary in this article is built using syntactically correlated word combinations. In this case, the possible additional meanings of separate fragments of the text are neglected. The quality of the summary is evaluated by a matching to the source text in terms of semantics.
The main problem is split into two parts: an evaluation of the whole text semantics, without subdivision into parts, and the text transformation to derive an annotation.
The architecture of the developed system and the main algorithm are described. An example of summary derived by the system and its quality evaluation has been provided. The current version of the system has following restrictions: it does not permit any formulas and special signs.
The volume includes papers presented at the 17th International Workshop on Treebanks and Linguistic Theories (TLT), which brings together developers and users of linguistically annotated natural language corpora. As ‘treebanks’ we consider any pairing of natural language data (spoken or written) with annotations of linguistic structure at various levels of analysis, ranging from e.g. morpho-phonology to discourse. The articles address all aspects of treebank design, development, and use, including reflections on the design of linguistic annotations, methodology studies, resource announcements or updates, annotation or conversion tool development, and reports on treebank usage.
This paper is a first step towards a corpus-based description of the semantics of Russian pronouns in intensional contexts. Having justified the use of corpus in (formal) semantic research, I delineate a particular issue within the topic: whether a given pronoun is interpreted de se or de re in counteridentity contexts.
A counteridentity context is a clause within the scope of a counterfactual (clause or adverbial) that affects the identity of a real individual, e.g. if I were you, were I you, etc. If a pronoun such as I, my or the Russian reflexive possessive svoj is used in such a context, two options are theoretically possible: either it picks out the speaker’s real self (de re), or it refers to the identity assumed by the speaker in the contrary-to-fact situations introduced by the counterfactual (de se).
Using data from the GICR corpus (approx. 20 billion tokens), I show that for the Russian first-person singular pronoun ja and its corresponding possessive moj, de se reference is possible but de re interpretation is more frequent. The opposite holds for the reflexive sebja, whereas svoj is interpreted de se with no exception. Special attention is paid to situations where more than one referential strategy is possible. The paper concludes with a couple of observations relevant for the future formal accounts of de se reference.
The Internet plays an important role in the continued functioning of extremist and terrorist groups. Studying extremist ideology based on linguistic analysis using methods of corpus and computer linguistics to help supplement and make qualitative analysis more objective is crucial. However, corpus-based linguistic research into the ideology of extremists remains scarce. This is due to a limited access to such texts. The Dark Web Project of the University of Arizona AI Lab that contains gigabytes of texts of private extremist and terrorist forums is a valuable source for corpus-based studies of extremist discourse. The aim of the research is a corpus-based study of Russian-language posts of Caucasian extremists from KavkazChat forum (included on the RF Federal list of extremist materials) where The 2010 Moscow Metro bombings are discussed. WordSmith Tools software package was used to identify most frequent words and word clusters, build concordances, find collocates, etc. A comparative corpus analysis of texts by Islamic extremists and those by common Internet users on the same topic (comments on relevant newsfeeds) allowed us to identify a number of features of Islamic extremist rhetoric.
The volume is the third issue of a corpora-based grammar of Russian. The volume deals with the issues of parts of speech and, more generally, with formal classes of lexicon, It comprises descriptive papers of separate POS and lesser world classes.
The “Taiga” project unites the corpus and the syntactic parser, being created in a new field of the corpus linguistics: the material obtained primarily meets the needs of machine learning, rather than linguistic search. The authors consider in detail the methodology for constructing the corpus, balance, volume and composition of its’ segments, format and quality of tagging — which meets the current requirements for the development of tools for processing Russian language. Within the framework of the project, the creation of a large and open-source syntactic corpus in the Universal dependencies format is planned
The paper deals with the encoding of “right” and “left” in Katharevousa Greek, which provides us with worth-exploring data on intentionally archaizing, artificial language of the XIX-XX centuries. The research is carried out on the basis of the Corpus of Modern Greek and the translations of two Classical Greek texts (“Anabasis” by Xenophon and “The History of the Peloponnesian War” by Thucydides) into Katharevousa. Since Katharevousa is an archaizing language, one can suppose that it would copy the ancient means of marking “right” and “left”. On the other hand, the language was artificial, but based on the language variety, spoken by educated Greek people – so, the strategies of the spoken language of that time can also be expected. Such rules are not usually mentioned in grammar books, and in this domain we get an opportunity to analyze speakers’ intuitive choices. According to the available data, the translators used utterly different strategies than the ancient writers. This language prefers dynamic projections and adverbs to static prepositions, which is obvious not only from the translations, but also from the quantitative distribution of the markers. The archaization in spatial strategies is quiet selective and influenced mostly by the Old and New Testament texts, rather than by the Classical Antiquity. Moreover, the choice of the spatial marker can depend on extralinguistic factors.
The paper is focused on the study of reaction of italian literature critics on the publication of the Boris Pasternak's novel "Doctor Jivago". The analysys of the book ""Doctor Jivago", Pasternak, 1958, Italy" (published in Russian language in "Reka vremen", 2012, in Moscow) is given. The papers of italian writers, critics and historians of literature, who reacted immediately upon the publication of the novel (A. Moravia, I. Calvino, F.Fortini, C. Cassola, C. Salinari ecc.) are studied and analised.
In the article the patterns of the realization of emotional utterances in dialogic and monologic speech are described. The author pays special attention to the characteristic features of the speech of a speaker feeling psychic tension and to the compositional-pragmatic peculiarities of dialogic and monologic text.