Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 30 мая–3 июня 2012 г.). В 2 томах
The paper discusses sociolinguistic implementations of statistical analysis of the spoken subcorpus of the Russian National Corpus. Given the considerable size of the corpus (about 10 mln tokens), an analysis of co-variation of various linguistic parameters with one of the few sociolinguistic parameters available – the speaker’s gender – may give rich and interesting results. One specific example of co-variation is considered in detail: the mean length of the utterance (in tokens). Comparing this parameter in public communication shows statistically significant difference between the speech of men and women (men talk more), while the same difference is absent in private communication. Another important parameter is the gender of the addressee. Again, co-variation is quite different in public and private discourse. In private communication, the utterances are longer when addressing someone of the same sex, the difference between men and women is not statistically significant. In public communication, the utterances are longer when addressing a woman, whether the speaker herself is a man or woman. These conclusions are consistent with the results of sociolinguistic gender studies obtained elsewhere and by other methods. Linguistic difference between men and women are not absolute but depend on the communicative situation (public vs. private). Public discourse is a playground for linguistic competition in which men are the winning party. In private discourse, competition dissolves.
The report is devoted to the study of Russian adjectives ‘heavy’ and ‘light’. The unexpected symmetry of these lexemes is discussed: on the one side, they are antonymic practically in all meanings they have (internal symmetry), on the other side, this semantic area has the same structure in the languages that served the typological background for our research: Serbian, French, English and Chinese (external symmetry). Yet thorough research shows, that the similarity of lexemes has surface character. The following essential differences are revealed: 1. the adjective ‘heavy’ is used in direct meaning considerably more frequently than ‘light’; 2. the adjective ‘light’ is used more frequently in metaphoric contexts and it particularly becomes apparent while expressing the meaning of degree: the meaning of down-toner of the lexeme ‘light’ is better developed than the meaning of intensifier of the lexeme ‘heavy’; 3. adjective ‘heavy’ when it is used with certain nouns can involve the component of the meaning ‘slow’, while adjective ‘light’ can, on the one hand, involve the meaning with antonymical component ‘fast’, and on the other hand, through the meaning of down-toner, it can involve the component ‘slow’; 4. analogical phenomenon with adjectives that have the meaning of ‘light’ can be seen in estimative component: in the whole the situation ‘lightly’ is rated positively, but there can be contexts in which the adjective with the meaning of ‘light’ has negative connotations. The adjective with meaning ‘heavy’ in Russian can only have negative connotations, but it can also develop positive connotations in other languages (e.g. ‘important’ in Chinese), if its original meaning is not ‘difficult to lift or move’, but ‘(objectively) weighing a lot’.
The paper presents a project aimed at the development of a Russian Learner Parallel Corpus, discusses the existing analogues, describes the current status and the tasks in which it could be used. The existing parallel corpora contain (comparatively) “correct” translations; whereas the aim of the present project is to create a sufficiently large corpus of imperfectly translated Russian and English texts together with their sources and use it as a tool for translation studies, especially those related to translation mistakes. The new corpus will be a valuable resource for computational linguistics as it provides another way of getting data for evaluation which could be used to improve machine translation systems. As of now, the corpus is available on-line, it already contains nearly half a million word tokens and is growing. The main source of material is translations made by student translators in Russian universities.
The paper presents experimental results on automatic construction identification performed on the Russian National Corpus (RNC). For this purpose we developed a toolbox which allows to extract and process co-occurrence data from RNC samples. Russian nouns are chosen as target words. Lists of constructions were built for each target word. By constructions we mean frequent word combinations which include a target word and frequent lexical-semantic tags – context marker of certain meanings of a target word, as well as frequent lemmas representing the given lexical-semantic tags. E.g.: ВИД (kind, sort, type) + r:abstr t:sport: спорт (sport), футбол (football), биатлон (biathlon), etc. Extracted constructions are grouped according to their structure and lexical-semantic content. In conclusion we perform verification of experimental results which implies comparison of lists of constructions with lists of collocations, idioms, etc. registered in various linguistic resources (bigram search engines, dictionaries).