Автоматизация процесса адаптации текстов для электронных учебников. Проблемы и перспективы (на примере русского языка)
The paper is intended to describe the experience of using the authentic linguistic corpus materials within the project "Creating an electronic textbook of Russian as a foreign language". Special attention is paid to the fundamental principles of the new project – automatic adaptation of RNC’s linguistic material. Worked out by means of information technologies, the product is supposed to adapt the complexity of authentic texts in terms of their syntactic and morphologic structures and vocabulary. The stages indispensable to attain the objective are also explained in the article. The paper describes not only the algorithm for solving the tasks and the final result of the research, but also the difficulties, which the developers face, and their solutions.
Philological research, especially in the field of literature, is usually considered a "thing-in-itself"; the intrinsic value of this phenomenon involves extremely intuitive, creative, "human-readable" analysis. Meanwhile, modern variety of computer programs (semantic text referentors, tag clouds, concordansers, etc.), created also for the humanities, such as sociology, psychology, management, cannot but draw a philologist’s attention. The steps, how to work with a parallel subcorpus in Russian National Korpus, described in detail. Reviewed freeware LR aligner (for non-commercial use), compares translations in Russian the novel "All Red" by J.Chmielewska. As examples of lexical items selected the modal word "avos’", the word "nakonets" as an introductory and the circumstances of the word "ves’" and "tsely". The Program LF aligner treated three translations of the novel, the authors are M.Krongauz, V.Selivanova, O. Kuznetsova. Consistent description of the existing programs, testing them on art material and comparison of the received data with the existing traditional research, especially in the field of philology and foreign language teaching, is a new step of a text analysis.
Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents the first steps taken by Russian corpus linguistics toward the development of language corpora and corpus-based resources as well as their use in grammatical and lexical analysis.
The first part of the book focuses on the annotation of Russian texts at several levels: lemmas, part of speech and inflectional forms, word formation, lexical-semantic classes, syntactic dependencies, semantic roles, frames, and lexical constructions. We discuss various theoretical principles and practical considerations motivating the corpus markup design, provide details on the creation of lexical resources (electronic dictionaries and databases) and text processing software, and consider complicated cases that present challenges for the annotation of corpora both manually and automatically. In most cases we describe the annotation of the Russian National Corpus (RNC, ruscorpora.ru) and its affiliate project FrameBank (framebank.ru).
Frequency data depend not only on the representativeness and balance of texts in a corpus, but also on the rules and tools used for annotation. The book addresses the development of evaluation standards for Russian NLP resources, namely, morphological taggers and dependency parsers. In addition, the book presents several experiments on automatic annotation and disambiguation: lemmatization of word forms not in the dic- tionary; word sense disambiguation based on vectors formed by lexical, semantic and grammatical cues of context; and semantic role labeling.
The final chapters of the first part of the book outline two types of frequency dictionaries based on the RNC data: a general-purpose frequency dictionary and a lexico-grammatical one.
The second part of the book presents an analysis of corpus data and includes a number of case studies of Russian grammar and lexical-grammatical interaction using quantitative methods. The key concept underlying our analysis is the behavioral profile (Hanks 1996; Divjak, Gries 2006), which is the frequency distribution of variable elements in a linguistic unit as attested in a corpus. This covers grammatical profiles (the frequency distribution of inflected forms of a word), constructional profiles (the frequency distri- bution of argument or any other constructions attested for a key predicate), lexical and semantic profiles (the frequency distribution of words and lexical-semantic classes in construction slots or, more generally, in the context of a word), and radial category profiles (the frequency distribution of word senses and word uses across the radial category network of a polysemous unit). We use grammatical, constructional, semantic, and radial category profiling to study tense, aspect and mood specialization of Russian verb forms; to identify singular-oriented and plural-oriented nouns; to investigate factors for prefix choice and prefix variation in natural perfectives (chistovidovye perfectivy); to analyze constraints on the filling of slots in a construction and how this affects the meaning of the construction, taking as an example the Genitive construction of shape and the spatial construction with the preposition poverkh ‘up and over’.
The quantitative corpus-based techniques used for the analysis vary from simple descriptive statistics (e. g., absolute frequencies, percentages, measures of the central ten- dency and outliers) to exact Fisher test and logistic regression. We claim that the vector modeling approaches to quantitative grammatical studies in theoretical linguistics are no less effective than in computational linguistics, where they have become a standard tool.
The paper analyzes characteristics of academic language in science as an indispensable component of preparing foreign students for studying engineering at the university
A new electronic frequency dictionary shows the distribution of grammatical forms in the inflectional paradigm of Russian nouns, adjectives and verbs, i.e. the grammatical profile of individual lexemes and lexical groups. While the frequency hierarchy of grammatical categories (e.g. the frequency of part of speech classes or the average ratio of Nominative to Instrumental case forms) has long been the standard topic of research, the present project shifts the focus to the distribution of grammatical forms in particular lexical units. Of particular concern are words with certain biases in grammatical profile, e.g. verbs used mostly in Imperative, in past neutral or nouns used often in plural. The dictionary will be a source for many of the future research in the area of Russian grammar, paradigm structure, grammatical semantics, as well as variation of grammatical forms.
The resource is based on the data of the Russian National Corpus. The article addresses some general issues such as corpora use in compiling frequency resources and technology of corpus data processing. We suggest certain solutions related to the selection of data and the level of granularity of grammatical profile. Text creation time and language registers are discussed as parameters which may shape the grammatical profile fluctuations.
Our research aims at automatic identification of constructions associated with particular lexical items and its subsequent use in building the catalogue of Russian lexical constructions. The study is based on the data extracted from the Russian National Corpus (RNC, http://ruscorpora.ru). The main accent is made on extensive use of morphological and lexico-semantic data drawn from the multi-level corpus annotation. Lexical constructions are regarded as the most frequent combinations of a target word and corpus tags which regularly occur within a certain left and/or right context and mark a given meaning of a target word. We focus on nominal constructions with target lexemes that refer to speech acts, emotions, and instruments. The toolkit that processes corpus samples and learns up the constructions is described. We provide analysis for the structure and content of extracted constructions (e.g. r:ord der:num t:ord r:qual|pervyj ‘first’ + LJUBOV’ ‘love’; LJUBOV’ ‘love’ + PR|s ‘from’ + ANUM m sg gen|pervyj ‘first’ + S f inan sg gen|vzgljad ‘sight’ = love at first sight). As regards their structure, constructions may be considered as n-grams (n is 2 to 5). The representation of constructions is bipartite as they may combine either morphological and lemma tags or lexical-semantic and lemma tags. We discuss the use of visualization module PATTERN.GRAPH that represents the inner structure of extracted constructions.
A description of the nationally oriented training complex for learning Russian as a foreign language “V Dobryi Put’!” for German speakers which includes a textbook, an audio‑supplement, a video‑film and a test system is given. The computer-based training system, designed for this complex ensures effective teaching foreign students Russian phonetics, vocabulary, grammar, speech etiquette, and also facilitates the formation of the necessary communication skills.
The paper is focused on the study of reaction of italian literature critics on the publication of the Boris Pasternak's novel "Doctor Jivago". The analysys of the book ""Doctor Jivago", Pasternak, 1958, Italy" (published in Russian language in "Reka vremen", 2012, in Moscow) is given. The papers of italian writers, critics and historians of literature, who reacted immediately upon the publication of the novel (A. Moravia, I. Calvino, F.Fortini, C. Cassola, C. Salinari ecc.) are studied and analised.
In the article the patterns of the realization of emotional utterances in dialogic and monologic speech are described. The author pays special attention to the characteristic features of the speech of a speaker feeling psychic tension and to the compositional-pragmatic peculiarities of dialogic and monologic text.
I give the explicit formula for the (set-theoretical) system of Resultants of m+1 homogeneous polynomials in n+1 variables