?
Building a Dictionary-Based Lemmatizer for Old Irish
P. 12–17.
Dereza O.
This paper explores the problem of developing NLP tools for morphologically rich and orthographically inconsistent classical languages. It is a case study of building a lemmatizer for Old Irish using only a dictionary and an unlabeled corpus as sources of data. At the current stage, the lemmatizer shows 76.31% average recall score on a corpus of ca. 100,000 tokens and is able to predict lemmas for out-of-vocabulary words.
In book
Vol. 6: Celtic Language Technology Workshop. , P.: [б.и.], 2016.
Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155
This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...
Added: March 10, 2026
Afanasev I., Glazkova A., Lyashevskaya O. et al., , in: Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025).: Association for Computational Linguistics, 2025. P. 157–170.
Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language ...
Added: March 10, 2026
Glazkova A., Lyashevskaya O., Morozov D. et al., Journal of Mathematical Sciences 2025 Vol. 546 P. 32–47
This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pretrained language models. The first approach is generative, where the ...
Added: March 10, 2026
Biryukova K., Chelnokova D., Erkenova J. et al., Communications in Computer and Information Science 2024 Vol. 2364 CCIS P. 109 – 121
Added: February 25, 2026
Мурсалимов К. А., Государство, религия, церковь в России и за рубежом 2025 Т. 43 № 4 С. 233–295
For the first time in Russian, a complete translation of Cáin Adomnáin (the
Law of Adomnán), a remarkable specimen of ancient Irish canon and secular
law from the seventh century, is published. It was adopted with the aim
of protecting women, children, and clergy from military violence — that is,
those categories of the population that, in accordance with ...
Added: February 24, 2026
П.Е. Белова, А.К. Сафарян, В кн.: Научно-практическая конференция с международным участием "Национальные и международные тенденции и перспективы развития судебной экспертизы". Сборник докладов.: Н. Новгород: Изд-во ННГУ им. Н.И. Лобачевского, 2024.
В данной статье представлено описание системы автоматического поиска и извлечения побуждений из текстов на русском языке FindImper, основанной на поиске глагольных форм и синтаксических связей. Алгоритм реализован на языке программирования Python с использованием библиотек для морфологического и синтаксического анализа и набора правил. Данный инструмент направлен на оптимизацию работы эксперта-лингвиста и доступен к использованию через веб-сайт ...
Added: January 30, 2026
Mylnikova A., Гасимов А. Р., Научно-техническая информация. Серия 2: Информационные процессы и системы 2025 № 9 С. 33–38
На основе изучения функционирования больших языковых моделей (LLMs) и специфических характеристик машинной обработки дискурса показано применение экспериментального метода компьютерного и лингвистического анализа для статистического исследования и интерпретации лингвистических характеристик текстов. В качестве материалов исследования использован лингвистический корпус текстов Brown, а также корпуса искусственно сгенерированных текстов с применением Claude Sonnet 3.7 и Grok-3. В механизмах обработки ...
Added: November 19, 2025
Shumen: INCOMA Ltd, 2025.
This paper introduces a rule-based lemmatization and word embedding pipeline for the endangered Bartangi language, part of the Pamiri language group. The system combines a manually constructed lemma dictionary with morphological suffix rules to improve linguistic consistency in low-resource settings. The results demonstrate enhanced lemmatization accuracy and higher-quality embeddings for downstream NLP tasks. The work ...
Added: October 20, 2025
Khomenko A., Kasimova L., Sychugov E. et al., Psychiatria Danubina 2025 Vol. 37 No. Suppl. 1 P. 213–223
Background: Early recognition of autoaggressive tendencies in young people is essential for diagnostic screening and reducing suicidality risks. This can be achieved through psycholinguistic approaches such as corpus analysis and eye-tracking studies. Corpus research helps to develop generalized speech patterns of those at risk of suicide, while oculographic methods examine perceptual cues linked to suicidal ...
Added: October 19, 2025
[б.и.], 2025.
This collection includes 39 papers from the Dialogue 2025 International Conference on Computational Linguistics and Intelligent Technologies, representing a wide range of theoretical and applied research in the fields of natural language description, modeling language processes, and the development of practical computational linguistic technologies.
This publication is intended for specialists in theoretical and applied linguistics and ...
Added: October 19, 2025
Chepikov I., Karpov I., , in: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED.: Springer, 2025. P. 352 – 358.
Modern LLM models such as BERT, ChatGPT, DeepSeek have shown great potential in solving various tasks, including text classification, text generation, analysis and summary of documents. In this paper, we show that these models close to classical ML approaches based on decision trees not only in text processing, but also in processing classical tabular data ...
Added: September 4, 2025
Мазитова Л. Л., Panteleeva L., Вестник Самарского университета. История, педагогика, филология 2024 Т. 30 № 4 С. 156–164
The article describes the methodology for creating an anthropological corpus of texts that are united by
belonging to the mining profession. The content of the work correlates with three research tasks: development of a
thematic classification, introduction of conventions for highlighting narratives in the text, 3) determination of principles
for organizing the corpus according to the themes of ...
Added: January 18, 2025
Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84
Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...
Added: January 7, 2025
Kolmogorova A., Куликова Е. Р., Колмогорова П. А., Текст. Книга. Книгоиздание 2025 № 38 С. 29–54
The article is devoted to the linguistic featuring of the texts of the Virtual visit to the State Hermitage Museum, available on the its official website. The purpose of the study is to analyze the set of lexical, morphological, syntactic and discursive metrics of the linguistic complexity of these texts in comparison with the same ...
Added: November 8, 2024