Reflections of syntactic structures in non­autoregressive language models

Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155

This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...

Added: March 10, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

MuMMy: Multimodal Dataset supporting VLM-based Egyptology Research Assistant

Golyadkin M., Innokentiy Humonen, Rubanova V. et al., , in: MM '25: Proceedings of the 33rd ACM International Conference on Multimedia.: Association for Computing Machinery (ACM), 2025. P. 12875–12881.

We present the first multimodal dataset MuMMy, for developing research assistants that can interpret Egyptian hieroglyphic texts. It pairs images with Gardiner codes, transliteration, and English translation at two levels of granularity. We also evaluate several deep learning pipelines across OCR, transliteration, and translation tasks, revealing the complexity of the domain and the challenges posed ...

Added: November 8, 2025

Автоматическая саммаризация родительских чатов в WhatsApp

Dmitrieva K., Жолус М. Р., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2025 Т. 23 № 1 С. 80–92

Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creating a shorter version of the source text. In today’s world the amount of information consumed by people is constantly increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main approaches ...

Added: July 8, 2025

Методы и средства извлечения терминов из текстов для терминологических задач

Bolshakova E. I., Семак В. В., Программные продукты и системы 2025 Т. 38 № 1 С. 5–16

The current state in the field of automatic term extraction from specialized natural language texts, including scientific and technical documents, is considered. Practical applications of methods and tools for extracting terms from texts include creation of terminological dictionaries, thesauri, and glossaries of problem oriented domains, as well as extraction of keywords and construction of subject ...

Added: July 2, 2025

Automation of Forensic Authorship Attribution: Problems and Prospects

Romanova T. V., Khomenko A., Legal Issues in the Digital Age 2022 Vol. 3 No. 2 P. 90–115

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with ob jectification of the available data with the help of mathematical statistics. The algo rithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of ...

Added: March 12, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Think about what you’ve learned: анализ тональности для моделирования пользовательского опыта в сфере онлайн-образования

Kirina M., Человек: образ и сущность. Гуманитарные аспекты 2024 № 2(58) С. 176–204

The article focuses on the application of opinion mining techniques to evaluate user experience on the Hyperskill educational platform, using Python, Java, and Kotlin programming projects as the basis of analysis. The study utilizes sentiment analysis and keyword extraction methods to gauge users' attitudes towards the platform, learning process, and topics covered. To achieve this, ...

Added: December 9, 2023

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 307–318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

Identifying and Visualizing Trends in Science, Technology, and Innovation Using SciBERT

Lobanova P., Bakhtin P., Sergienko Y., IEEE Transactions on Engineering Management 2024 No. 71 P. 11898–11906

Identification of science, technology, and innovation trends is a critical topic both for the scientific community and for companies that develop technologies, work on science and technology policy or invest in high tech. In this research authors demonstrate a novel approach implemented in iFORA system (developed by National Research University Higher School of Economics) using ...

Added: September 8, 2023

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).: Association for Computational Linguistics, 2023. P. 174–186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023.

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

Marseille: European Language Resources Association (ELRA), 2022.

The proceedings are organised on the basis of the 22 Tracks of the Conference on Language Resources and Evaluation (LREC) held in Marseille, France, from 20 to 25 June 2022. Major topics include corpora and annotation (including tools, systems, treebanks), information extraction and information retrieval (including ner, qa, text mining, document classification, text categorisation), applications involving lrs and evaluation (including ...

Added: February 22, 2023

Автоматическая оценка впечатлений обучающихся методами анализа тональности (на материале отзывов на онлайн-курсы на русском и английском)

Kirina M., Тельнина Л. Д., В кн.: Цифровая гуманитаристика и технологии в образовании (DHTE 2022): сб. статей III Всероссийской научно-практической конференции с международным участием. 17—18 ноября 2022 г.: ФГБОУ ВО МГППУ, 2022. С. 355–374.

В статье описывается эксперимент, направленный на сравнение эффективности инструментов анализа тональности для оценки пользовательского опыта на материале публичных отзывов на онлайнкурсы на образовательной платформе Stepik. Рассматриваются результаты автоматического извлечения сентимент-оценок пользователей на соответствующие курсы как на русском, так и на английском языках. Для русскоязычных текстов обсуждается применение словаря эмотивной лексики «КартаСловСент» и предобученной на датасете ...

Added: December 9, 2022

A hybrid lemmatiser for Old Church Slavonic

Afanasev I., / NRU HSE. Series WP BRP "Linguistics". 2021.

The article considers a lemmatiser that is developed specifically for Old Church Slavonic (OCS). The introduction underlines the problem of the lack of lemmatisers that might deal with different datasets of the OCS. The review gives a short description of previous attempts and current trends in lemmatisation. The lemmatiser is hybrid-based and uses the advantages ...

Added: December 28, 2021

Noisy Text Sequences Aggregation as a Summarization Subtask

Pletnev Sergey, , in: Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale (CSW 2021).: Copenhagen, Denmark: CEUR Workshop Proceedings, 2021. Ch. 1 P. 15–20.

Most speech-driven systems on the first step convert audio to text through an automatic speech recognition (ASR) model and then pass the text to any downstream natural language processing (NLP) modules. However, these ASR models can lead to system failure or undesirable output when being exposed to natural language perturbation or variation in practice. In ...

Added: December 13, 2021

Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale (CSW 2021)

Copenhagen, Denmark: CEUR Workshop Proceedings, 2021.

The second workshop on Crowd Science is organized in conjunction with the 47th International Conference on Very Large Data Bases (VLDB 2021). This workshop is the second in a series of events that has the goal of helping crowdsourcing “transition” from art to science, and tackles the research challenges that we face to make crowdsourcing ...

Added: December 13, 2021

Uncertainty Estimation in Autoregressive Structured Prediction

Andrey Malinin, Gales M., , in: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021). ICLR, 2021.: ICLR, 2021. P. 1–31.

Added: November 1, 2021

?

Reflections of syntactic structures in non­autoregressive language models

In book

Reflections of syntactic structures in nonautoregressive language models