Evaluation of collocation extraction methods for the Russian language

Pivovarorva L.; Kormacheva D.; M. Kopotev

?

Evaluation of collocation extraction methods for the Russian language

P. 137–157.

Pivovarorva L., Kormacheva D., Kopotev M.

This paper focuses on empirical collocations, understood here as word co-occurrences that 1) are frequent enough to be extracted automatically and 2) may be semantically and/or syntactically bounded to various extents. Our main goal is to examine closely five window-based methods for empirical collocation extractions that are widely used in corpus-based studies, sometimes without proven efficiency. Our study evaluates the methods’ reliability for Russian data by testing two hypotheses: a) collocations listed in a professionally compiled dictionary (i.e., those considered fixed to some extent by experts in the field) should have higher rankings in automatically extracted lists of collocations, and b) collocations considered fixed expressions by native speakers should have higher rankings in automatically generated lists. Our research indicates that raw frequency, t-score, log-likelihood, and Dice give the best rankings, while MI and wFR demonstrate poorer results in both evaluations. In general, all of these evaluations, although each has its own limitations, lead to equatable results, which should be taken into account in future research.

Language: English

Text on another site

Keywords: коллокации количественные методы collocation extraction method, evaluation, frequency, t-score, log-likelihood, Dice MI, wFR t-score, log-likelihood, Dice MI, wFR

In book

Quantitative approaches to the Russian language

Abingdon: Routledge, 2018.

Целевые каузальные эффекты в социальных исследованиях

Sokolov B., Социология: методология, методы, математическое моделирование 2025 № 61 С. 7–76

This article reviews a set of estimands commonly used in modern applied research to operationalize causal inquiries within the Rubin Causal Model (RCM). I first introduce the basic average treatment effects (ATE, ATT, ATC) and then describe their main extensions, including local and conditional treatment effects, causal interactions, causal mediation, multivalued or continuous treatments, and ...

Added: December 19, 2025

«Социальное пространство» П. Бурдьё: история конструирования понятия

Shmatko N., Маркова Ю. В., Социологический журнал 2025 Т. 31 № 1 С. 110–123

The article deals with the history and interpretation of Pierre Bourdieu’s concept of “social space”. With the help of the concept, Bourdieu described a set of interrelated social phenomena that support and reflect each other. He defined social space as a multidimensional distribution of agents (individual or collective) over objective positions determined by the distribution of effective resources ...

Added: May 23, 2025

Медиаконцепт «вакцинация» в дискурсе немецких СМИ во время пандемии COVID-19

Balakina Y. V., Вестник Томского государственного университета 2024 № 509 С. 23–34

The relevance of the research is justified by the influence of the media on the consciousness and behavior of people during the crisis, allowing to form discursive phenomena that have specific characteristics. In addition, it seems particularly relevant to use linguistic tools to describe media and political phenomena, as well as to apply media and ...

Added: December 12, 2024

Запутывать мозги и ездить на шее: корпусное исследование функционирования фразеологизированных коллокаций в устном повседневном общении

Попова Т. И., Драчева К. И., В кн.: Дискурсивные практики в цифровую эпоху: традиции и инновации.: Н. Новгород: Изд-во ННГУ им. Н.И. Лобачевского, 2024. С. 208–217.

Статья посвящена описанию устойчивых неоднословных единиц (УНЕ) русской устной разговорной речи. Наблюдения и выводы основаны на анализе материала двух корпусов: подкорпуса русского языка повседневного общения «Один речевой день» (ОРД) общим объемом 300 тысяч словоупотреблений (195 эпизодов), Устного корпуса Национального корпуса русского языка (360 словоупотреблений) и корпуса «Социальные сети» (2615 словоупотреблений). В исследовании более подробно рассматриваются фразеологизированные коллокации ...

Added: October 29, 2024

Эмпирические вызовы и методологические подходы в сравнительной политологии (сквозь призму “Политического атласа современного мира 2.0”)

Melville A. Y., Мальгин А. В., Mironyuk M. et al., Полис. Политические исследования 2023 № 5 С. 153–171

In recent decades, the expanding volume, diversity and coverage of data have created new or have transformed existing areas of research. They have also turned data into a key element of politics today. In this context, the status of empirical research that became the political science mainstream at the turn of the 20th - 21st ...

Added: September 29, 2023

Семантическое наполнение понятия «популизм» в английском языке (опыт лексикографического и корпусного анализа)

Gritsenko E., Галочкин А. Е., Вопросы лексикографии 2023 № 27 С. 29–46

The aim of the article is to reveal the semantic content of the concept “populism” in modern English. The need to address this topic is driven by the fact that a significant part of the research is dedicated to the analysis of specific forms of populism or populist parties in the aspect of political science, discourse theory, political rhetoric, ...

Added: May 6, 2023

Плеонастические причастия в современной русской речи: функции и тенденции развития

Ю. М. Кувшинская, Н. А. Зевахина, Acta Linguistica Petropolitana. Труды института лингвистических исследований 2023 Т. 19 № 1 С. 138–192

The paper studies tendencies in the use of full single (i.e. without their arguments) redundant participles in the attributive position in the Russian written discourse. Relying upon the data of the Russian National Corpus and the Corpus of Russian Student Texts, as well as a number of the examples collected from various written sources, the ...

Added: December 8, 2022

Количественная оценка перекрестных сетевых эффектов для нетрансакционных платформ

Рожкина В. С., Golovanova S., Korneeva D., Вестник Московского университета. Серия 6: Экономика 2022 № 4 С. 17–38

The analysis of cross-network effects is important for considering the impossibility of their direct observation and the influence of cross-network effects on the values of all tests in competition policy, pricing practice and merger valuation. The article summarizes the experience of quantifying cross-network effects for non-transactional platforms. This paper systematizes methods for assessing cross-network effects ...

Added: September 15, 2022

Дискурсы в агитационных материалах «красных» и «белых» периодических изданий пермской губернии в годы Гражданской войны

Ехлакова А. Р., Ismakaeva I., В кн.: Пятая зимняя школа по гуманитарной информатике.: Калининград: Балтийский федеральный университет им. Иммануила Канта, 2021. С. 20–26.

Анализируются наиболее часто встречающиеся словоформы в агитационных материалах публикаций «красных» и «белых» периодических изданий Пермской губернии в годы Гражданской войны. Применение теории дискурса Э. Лакло и Ш. Муфф позволило рассмотреть периодику «красных» и «белых» как поле борьбы соответствующих дискурсов в формировании значений и понимании мира. На основе инструментария программы AntConc (N-gram, Collacates) выделены наиболее часто ...

Added: February 17, 2022

Delta Берроуза для древнегреческих авторов: опыт применения

Alieva O., Schole. Философское антиковедение и классическая традиция 2022 Т. 16 № 2 С. 693–705

This paper tests the effectiveness of Burrow’s Delta Method on a corpus of selected prose writings in ancient Greek. When tested on a corpus of fourteen and eight authors, the method yields good results with relatively small samples (1000, 3000, and 5000 words) and different word frequency vectors (100, 200, 500 words), but its performance ...

Added: February 9, 2022

Когнитивная обработка биномиалов русского языка тюркско-русскими билингвами

Буб А. С., Artemenko E., Язык и культура 2019 № 48 С. 32–45

The article concerns one of the aspects of bilingualism, namely the study of cognitive processing of lexical units in bilinguals. As a review of the scientific literature shows, the bilingual mental lexicon differs from the monolingual mental lexicon. In the latter, words do not exist separately, but together with colocational links, i.e. in conjunction with ...

Added: October 29, 2021

О СОВРЕМЕННОСТИ «СОВРЕМЕННОГО СОСТОЯНИЯ ИЗУЧЕНИЯ ПОЛИТИКИ» КРУГЛЫЙ СТОЛ

Gaman-Golutvina O. V., Панов П. В., Filippov A. F., Полития: Анализ. Хроника. Прогноз 2021 № 1(100) С. 193–209

Added: April 12, 2021

Методы компаративных исследований

Gaman-Golutvina O. V., В кн.: Политическая компаративистика.: М.: Аспект Пресс, 2020. С. 85–104.

Added: April 12, 2021

Соотношение сил между великими державами в «Группе 20»: анализ при помощи метода многомерного шкалирования

Артюшкин В. Ф., Kazantsev A., Сергеев В. М., Полис. Политические исследования 2021 Т. 2 С. 125–138

. This article applies a method of multidimensional scaling (visualization of multi-dimensional structures) to studying different dimensions of power competition between the great states. On the basis of analysis of the Neo-Realist, Neo-Liberal, and World-systems theory literature on global hegemony, 8 criteria of global leadership were defined: GDP per capita (PPP), military expenditure (% of ...

Added: February 8, 2021

Collocations and near-native competence: Lexical strategies of heritage speakers of Russian

Kopotev M., Polinsky M., Kisselev O., International Journal of Bilingualism 2020 P. 1–28

This paper presents an exploratory study on the use of frequency-based probabilistic word combinations in Heritage Russian. The data used in the study are drawn from three small corpora of narratives, representing the language of Russian heritage speakers from three different dominant-language backgrounds, namely German, Finnish, and American English. The elicited narratives are based on ...

Added: September 30, 2020

О чувстве уважения в русском языковом сознании: уважения достойно…

Botchkarev A., Slavica Slovaca 2020 Т. 55 № 1 С. 46–52

The article explores the ways of displaying uvazheniye ‘respect’ in the Russian language consciousness. The National Russian Corpus is more appropriate for this purpose, because a conceptual configuration of an analyzed concept is not present in a “finished” form in any single utterance, but may be reconstructed on the totality of all possible utterances. According ...

Added: June 24, 2020

Журналы земских собраний: организация информации на основе информационных систем (на примере Пермской губернии)

Kornienko S., Ехлакова А. Р., В кн.: Сборники Президентской библиотекиВып. 8: Цифровые проекты в современной информационной среде: наука и практика.: СПб.: Президентская библиотека имени Б.Н. Ельцина, 2018. С. 70–83.

Анализируются возможности использования информационных систем и количественных методов для изучения журналов земских собрании как исторического источника. Приведена характеристика журналов собраний как одного из основных делопроизводственных источников земских учреждений, охарактеризованы информационные системы, созданные в Центре цифровой гуманитаристики Пермского государственного национального исследовательского университета. На основе информационных систем проанализированы результаты организации информации в журналах земских собраний, получены количественные ...

Added: October 20, 2019

LESS IS DOWN: корпусный анализ структуры метафорического значения глаголов падать и упасть

Kultepina O., Acta Linguistica Petropolitana. Труды института лингвистических исследований 2020 Т. 1 № XVI С. 344–367

The paper raises an issue of possibilities that are provided by corpus-based approach in analysis of metaphorical transfer based on the aspectual pair upast’ / padat’ (‘to fall’). The author reviews the structure of metaphorical meaning of predicates that enforce the Lakoff’s metaphor ‘LESS IS DOWN’ and also analyses how collocations correlate with valency structure. ...

Added: October 7, 2019

Метр отрезков длиннее строки в башкирском силлабическом стихе

Orekhov B., Известия РАН. Серия литературы и языка 2019 Т. 78 № 2 С. 41–50

The paper considers a specific element of syllabic versification on the Bashkir text data. We examine the ordered alternations of lines of different lengths. Such verse forms exist in Turkic verse along with the usual isosyllabic poems. The status of such forms is ambiguous; they can be viewed both as a stanza and as a ...

Added: September 18, 2019

Специфические слова и выражения русских классиков XIX века: опыт контрастивного корпусного исследования

Orekhov B., Ученые записки Петрозаводского государственного университета. Серия: Общественные и гуманитарные науки 2019 № 5 С. 70–75

The paper presents the results of a quantitative study that identifies characteristic and specific low-frequency words for the prose of Russian classic writers of the XIX century. TF-IDF measure and a large collection of the XIX century texts by Turgenev, Goncharov, Leskov and Dostoevsky were used to identify words and phrases that are rarely found ...

Added: September 18, 2019