Exploration of register-dependent lexical semantics using word embeddings

A. B. Kutuzov; E. Kuzmenko; Marakasova A.

?

Exploration of register-dependent lexical semantics using word embeddings

P. 26–34.

Kutuzov A. B., Kuzmenko E., Marakasova A.

We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach.

Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.

Language: English

Full text

Text on another site

Keywords: natural language processing автоматическая обработка естественного языка digital humanities communicative grammar, text structure, texts typology, fiction – non-fiction, register.digital humanities исследования жанра word2vec word embeddings

In book

Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

Osaka: [б.и.], 2016.

Цифровое сопровождение гуманитарных образовательных программ

Kornienko S., Ismakaeva I., Senina A., Отечественная и зарубежная педагогика 2026 Т. 1 № 2(113) С. 91–102

In the digital age, digital proficiency is becoming a key literacy of the 21st century, particularly relevant for students in humanities education programs. This article proposes a comprehensive model for integrating digital technologies into humanities education at a university. The methodology relies on case studies and design-based research elements, including analysis of regulatory documents, educational ...

Added: April 30, 2026

Дискриминативная лемматизация сокращений в эпоху LLM

Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155

This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...

Added: March 10, 2026

RuCLEVR: A Russian Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Biryukova K., Chelnokova D., Erkenova J. et al., Communications in Computer and Information Science 2024 Vol. 2364 CCIS P. 109 – 121

Added: February 25, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Открытые компьютерные инструменты для решения задач оцифровки и анализа русскоязычного текста в области Digital Humanities

Orekhov B., Цифровые гуманитарные исследования 2025 № 2 С. 71–83

В статье дается обзор не очень известных модулей, которые можно использовать для решения задач Digital Humanities, связанных с текстовым анализом и оцифровкой. К таким модулям отнесены те, которые облегчают оцифровку текстов, напечатанных в дореформенной орфографии (OCR-модель и конвертер в новую орфографию), акцентуатор, расставляющий ударения, детектор прямой речи, код, позволяющий оценить формульность фольклорного текста, конвертер для ...

Added: December 19, 2025

Digital Humanities and Literary Realism

Skorinkin D., Orekhov B., , in: The Oxford Handbook of Global Realisms.: Oxford: Oxford University Press, 2025. Ch. 10 P. 177–204.

This chapter investigates literary prose of the realist era in Russia using digital humanities methods. It focuses on how computational analysis can enhance an understanding of descriptions of literary characters, geographical locations, and lexical composition in literary texts. Using a corpus of more than five hundred texts (forty-six million word occurrences), it eschews the focus ...

Added: September 14, 2025

Rewriting the Rules: LLMs Vs. Traditional ML in University Admissions

Chepikov I., Karpov I., , in: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED.: Springer, 2025. P. 352 – 358.

Modern LLM models such as BERT, ChatGPT, DeepSeek have shown great potential in solving various tasks, including text classification, text generation, analysis and summary of documents. In this paper, we show that these models close to classical ML approaches based on decision trees not only in text processing, but also in processing classical tabular data ...

Added: September 4, 2025

Автоматическая саммаризация родительских чатов в WhatsApp

Dmitrieva K., Жолус М. Р., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2025 Т. 23 № 1 С. 80–92

Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creating a shorter version of the source text. In today’s world the amount of information consumed by people is constantly increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main approaches ...

Added: July 8, 2025

Методы и средства извлечения терминов из текстов для терминологических задач

Bolshakova E. I., Семак В. В., Программные продукты и системы 2025 Т. 38 № 1 С. 5–16

The current state in the field of automatic term extraction from specialized natural language texts, including scientific and technical documents, is considered. Practical applications of methods and tools for extracting terms from texts include creation of terminological dictionaries, thesauri, and glossaries of problem oriented domains, as well as extraction of keywords and construction of subject ...

Added: July 2, 2025

Высокоуровневая семантическая интерпретация структуры статических моделей для русского языка

Serikov O., Ganeeva V., Аксенова А. А. et al., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2023 Т. 21 № 1 С. 67–82

Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent ...

Added: April 28, 2025

Automation of Forensic Authorship Attribution: Problems and Prospects

Romanova T. V., Khomenko A., Legal Issues in the Digital Age 2022 Vol. 3 No. 2 P. 90–115

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with ob jectification of the available data with the help of mathematical statistics. The algo rithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of ...

Added: March 12, 2025

Основы цифровой филологии: методы и принципы компьютерного анализа текста

Kazartsev (Evgenii Kazartcev) E., Пронин Д. Д., СПб.: Издательство "Политехника", 2024.

Учебник представляет собой уникальное издание, содержащее материал для обучения методам компьютерного анализа текстов, прежде всего, художественной литературы. Используются базы данных и корпусы, размещенные на цифровой платформе СОЦИОЛИТ, предназначенной для изучения взаимодействия литературы и общества. Представленные методы размыкают границы традиционной филологической науки, они позволяют проводить количественный и качественный анализ содержания и лексики текста в парадигме современной ...

Added: February 19, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Возможна ли цифровая история философии?

Alieva O., Историко-философский ежегодник 2024 Т. 39 С. 266–304

The article raises the question of the possibility of “digitalization” in the field of the historical and philosophical research. We first give a brief overview of the main genres of philosophical historiography and then examine the compatibility of these genres with some instruments of natural language processing. It is argued that methods of distributional semantics ...

Added: December 28, 2024

Цифровые гуманитарные проекты: learning DH by doing

Gomeniuk N. V., Ismakaeva I., В кн.: Будь в курсе цифровых гуманитарных исследований.: Красноярск: Сибирский федеральный университет, 2024. С. 98–108.

Появление и развитие такой области, как цифровые гуманитарные науки (Digital Humanities), ставит перед университетами новые задачи по подготовке специалистов, обладающих не только глубокими знаниями в своей предметной области, но и владеющих современными цифровыми инструментами и методами. «Инфраструктурным» требованием к подготовке таких специалистов становится формирование у них проектного мышления и навыков проектной деятельности. Мы описываем опыт реализации ...

Added: December 3, 2024

Python для гуманитариев, или почему программированию невозможно научиться с первой попытки

Senina A., В кн.: Будь в курсе цифровых гуманитарных исследований.: Красноярск: Сибирский федеральный университет, 2024. С. 164–181.

Монография стала результатом Всероссийского семинара «Гуманитарная цифра в вузах: программы, курсы, компетенции». Собраны педагогические опыты, составляющие сегодня дидактическую основу цифровых гуманитарных наук. Предложенные читателю материалы посвящены широкому спектру направлений — самоопределению цифровых гуманитариев в современном университете, архитектурам магистратур и майноров, программам специальных и онлайн-курсов, цифровым компетенциям и проектным практикам. Будет интересна широкому кругу преподавателей-гуманитариев — историкам, филологам, лингвистам, философам, социологам, ...

Added: December 3, 2024

Как сделана цифровая история идей

Alieva O., В кн.: Будь в курсе цифровых гуманитарных исследований.: Красноярск: Сибирский федеральный университет, 2024. С. 51–59.

Цифровая история идей — сравнительно молодое направление внутри Digital Humanities, использующее инструменты корпусной лингвистики в сочетании с методологией Кембриджской школы и Begriffsgeschichte. Как теоретические рамки, так и практические воплощения этого подхода нуждаются в осмыслении, которое должно показать, во-первых, целесообразность, а во-вторых, возможность его усвоения в российском образовательном и научном контексте. Оставляя теоретические вопросы для другого ...

Added: December 3, 2024