Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing

A. B. Kutuzov

?

Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing

P. 288–300.

Kutuzov A. B., Козлова О. С.

In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora resolution as a way to exploit more co-occurrence data without directly increasing the size of the training corpus.
We replace three different types of anaphoric pronouns with their antecedents in the training corpus and evaluate the extent to which this affects the performance of the resulting models in lexical similarity tasks. CBOW and SkipGram distributed models trained on Russian National Corpus are in the focus of our research, although the results are potentially applicable to other distributional semantic frameworks and languages as well. The trained models are evaluated against RUSSE '15 and SimLex-999 gold standard data sets. As a result, we find that models trained on corpora with pronominal anaphora resolved perform significantly better than their counterparts trained on baseline corpora.

Language: English

Full text

Text on another site

Keywords: natural language processing автоматическая обработка естественного языка искусственные нейронные сети анафора anaphora resolution distributional semantics семантическая близость semantic similarity of words дистрибутивная семантика word2vec neural embeddings vector space model word2vec word embeddings разрешение анафоры векторные репрезентации лексики

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)

Вып. 15. , М.: Изд-во РГГУ, 2016.

Дискриминативная лемматизация сокращений в эпоху LLM

Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155

This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...

Added: March 10, 2026

RuCLEVR: A Russian Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Biryukova K., Chelnokova D., Erkenova J. et al., Communications in Computer and Information Science 2024 Vol. 2364 CCIS P. 109 – 121

Added: February 25, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Rewriting the Rules: LLMs Vs. Traditional ML in University Admissions

Chepikov I., Karpov I., , in: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED.: Springer, 2025. P. 352 – 358.

Modern LLM models such as BERT, ChatGPT, DeepSeek have shown great potential in solving various tasks, including text classification, text generation, analysis and summary of documents. In this paper, we show that these models close to classical ML approaches based on decision trees not only in text processing, but also in processing classical tabular data ...

Added: September 4, 2025

Автоматическая саммаризация родительских чатов в WhatsApp

Dmitrieva K., Жолус М. Р., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2025 Т. 23 № 1 С. 80–92

Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creating a shorter version of the source text. In today’s world the amount of information consumed by people is constantly increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main approaches ...

Added: July 8, 2025

Методы и средства извлечения терминов из текстов для терминологических задач

Bolshakova E. I., Семак В. В., Программные продукты и системы 2025 Т. 38 № 1 С. 5–16

The current state in the field of automatic term extraction from specialized natural language texts, including scientific and technical documents, is considered. Practical applications of methods and tools for extracting terms from texts include creation of terminological dictionaries, thesauri, and glossaries of problem oriented domains, as well as extraction of keywords and construction of subject ...

Added: July 2, 2025

Экономические и социальные аспекты атомной энергетики в условиях развития технологий искусственного интеллекта

Podchufarov A., Galkina A. N., Ванина С. С. et al., Экономика и управление: проблемы, решения 2025 Т. 5 № 4 С. 61–74

Under modern conditions, the introduction of artificial intelligence technologies is becoming a significant factor in the development of high-tech industries. The article presents the results of a study of the prospects for the use of intelligent analytical systems in nuclear energy. The experience of foreign countries is analyzed and the features of successful projects using ...

Added: June 5, 2025

Eye-tracking study of unambiguous anaphora resolution in Russian

Tuzhik O., Khanova A., Kudryavtsev S. et al., , in: 12th Novi Sad workshop on Psycholinguistic, neurolinguistic and clinical linguistic research.: Novi Sad: Faculty of Philosophy, University of Novi Sad, 2025. P. 26–28.

Added: April 28, 2025

Высокоуровневая семантическая интерпретация структуры статических моделей для русского языка

Serikov O., Ganeeva V., Аксенова А. А. et al., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2023 Т. 21 № 1 С. 67–82

Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent ...

Added: April 28, 2025

Разрешение синтаксической местоименной анафоры в системе ЭТАП-3

Иншакова Е. С., В кн.: Сборник статей конференции "Информационные технологии и системы" (ИТиС'16).: М.: ИППИ РАН, 2016.

В данной статье идет речь о правилах установления антецедентов для двух классов анафорических выражений: возвратных местоимений и местоимений 3 лица. ...

Added: April 10, 2025

An anaphora resolution system for Russian based on ETAP-4 linguistic processor

Inshakova E.S., , in: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2019)Issue 18.: M.: Russian State University for the Humanitie, 2019.

Added: April 10, 2025

Automation of Forensic Authorship Attribution: Problems and Prospects

Romanova T. V., Khomenko A., Legal Issues in the Digital Age 2022 Vol. 3 No. 2 P. 90–115

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with ob jectification of the available data with the help of mathematical statistics. The algo rithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of ...

Added: March 12, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Fear and Loathing in Russian Literature: A Case of Emotion Annotation of Short Stories of the 20th Century

Anna Moskvina, Margarita Kirina, , in: 27th International Conference, IMS 2024, St. Petersburg, Russia, June 24–26, 2024, Selected Papers. Internet and Modern Society. Human-Computer Communication. CCIS, volume 2534Vol. 2534.: Springer, 2025. P. 113–129.

The paper presents an investigation of the emotional aspect of the Russian short story of the 20th century. Our study is two-fold: firstly, we delve into emotional representation at the lexical level, building upon previous work on utilizing vector models to quantify emotional content. In this study, we introduce an annotated corpus where words are ...

Added: November 29, 2024

Automatic detection of grammatical aspect of Russian verbs based on their morphological properties

Petrunina U., Filip H., , in: Proceedings of the Fourth International Workshop on Resources and Tools for Derivational Morphology.: Dubrovnik: Croatian Language Technology Society, 2023.

Added: October 2, 2024

Cross-country analysis of science, technology and innovation policies: non-covid-19 related and Covid-19 specific STI policies in OECD countries

Russo M., Pavone P., Meissner D. et al., Quality and Quantity 2025 Vol. 59 No. Suppl 1 P. S343–S367

In OECD countries, Science, Technology and Innovation (STI) policies were seen as key aspects of coping with the Covid-19 pandemic. Now that the pandemic is over, identifying which policy mix portfolios characterised countries in terms of their non-Covid-19 related and Covid-19 specific STI policies fills a knowledge gap on changes in STI policies induced by ...

Added: September 27, 2024

You shall know a piece by the company it keeps. Chess plays as a data for word2vec models

Orekhov B., / Series Computer Science "arxiv.org". 2024.

In this paper, I apply linguistic methods of analysis to non-linguistic data, chess plays, metaphorically equating one with the other and seeking analogies. Chess game notations are also a kind of text, and one can consider the records of moves or positions of pieces as words and statements in a certain language. In this article ...

Added: August 8, 2024