Методы и средства извлечения терминов из текстов для терминологических задач

?

Методы и средства извлечения терминов из текстов для терминологических задач

Программные продукты и системы. 2025. Т. 38. № 1. С. 5–16.

The current state in the field of automatic term extraction from specialized natural language texts, including scientific and technical documents, is considered. Practical applications of methods and tools for extracting terms from texts include creation of terminological dictionaries, thesauri, and glossaries of problem oriented domains, as well as extraction of keywords and construction of subject indexes for highly specialized documents.

The paper provides an overview of approaches to automatic recognition and extraction of terminological words and phrases, which cover traditional statistical methods, as well as methods based on machine learning, including learning by term features and learning using modern neural network transformer-based language models. A comparison of approaches is given, including quality assessments for term recognition and term extraction, and the most well-known software tools for automating term extraction within the statistical approach and learning by features are indicated.

The studies conducted by the authors on term recognition based on neural network language models are described, being applied to processing Russian scientific texts on mathematics and programming. The data set with terminological annotations created for training term recognition models is briefly characterized, which covers the data from seven related domains. The models were developed on the basis of pre-trained neural network model BERT, with its additional training (fine-tuning) in two ways: as a binary classifier of candidate terms (previously extracted from texts) and as a classifier for sequential labeling terminological words in texts. For the developed models, the quality of term recognition is experimentally evaluated, and a comparison with statistical method was carried out. The best quality is demonstrated by binary classification models, significantly surpassing the other approaches considered. The experiments also show the applicability of the trained models to texts in a related scientific field.

Language: Russian

DOI

FinTech and the green transition: Exploring pathways to ignite innovation for carbon neutrality in global supply chains

Yalcin H., Demirhan D., Aracioglu B. et al., Technology in Society 2026 Vol. 84 Article 103094

This article comprehensively evaluates the critical role of FinTech in promoting carbon neutrality and green logistics practices in global supply chains. In our study, using bibliometric analysis, social network analysis and natural language processing (NLP) methods, we evaluate the potential of FinTech innovations to increase traceability, transparency and efficiency in supply chain processes. In this ...

Added: March 11, 2026

Дискриминативная лемматизация сокращений в эпоху LLM

Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155

This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...

Added: March 10, 2026

30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4–6, 2025, Proceedings, Part I. Natural Language Processing and Information Systems. (LNCS, volume 15836)

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Association for Computational Linguistics, 2025.

The book contains this year’s edition of the Conference on Empirical Methods in Natural Language Processing! Importantly, it marks the 30th edition of EMNLP. With over 8,000 submissions, more than 3,000 accepted papers, and thousands of attendees, we have come a long way from that first workshop, which had 14 accepted papers. As the field looks ahead, Suzhou ...

Added: November 16, 2025

Автоматическая саммаризация родительских чатов в WhatsApp

Dmitrieva K., Жолус М. Р., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2025 Т. 23 № 1 С. 80–92

Automatic text summarization is one of the main tasks of natural language processing (NLP), which consists in creating a shorter version of the source text. In today’s world the amount of information consumed by people is constantly increasing, therefore more and more emphasis is being placed on the task of summarization. There are two main approaches ...

Added: July 8, 2025

Analysis of Images, Social Networks and Texts, 12th International Conference, AIST 2024, Bishkek, Kyrgyzstan, October 17–19, 2024, Revised Selected Papers

Springer, 2024.

This book constitutes the refereed proceedings of the 12th International Conference on Analysis of Images, Social Networks and Texts, AIST 2024, held in Bishkek, Kyrgyzstan, during October 17–19, 2024. The 16 full papers included in this book were carefully reviewed and selected from 70 submissions. They were organized in topical sections as follows: Natural Language Processing; Computer Vision; Data Analysis and Machine Learning; ...

Added: May 29, 2025

Knowledge Discovery, Knowledge Engineering and Knowledge Management: 15th International Joint Conference, IC3K 2023, Rome, Italy, November 13-15, 2023, Revised Selected Papers

Rome: Springer, 2025.

This book constitutes the refereed proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K 2023, held in Rome, Italy, during November 13-15, 2023. The 9 full papers and 8 short papers included in this book were carefully reviewed and selected from 166 submissions. They were organized in topical sections ...

Added: May 2, 2025

An experimental rule-based parser for Russian employing the NLP resources of the ETAP system

Inshakova E.S., Sizov V. G., , in: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020"Issue 19 (26).: ., 2020.

Added: April 10, 2025

Automation of Forensic Authorship Attribution: Problems and Prospects

Romanova T. V., Khomenko A., Legal Issues in the Digital Age 2022 Vol. 3 No. 2 P. 90–115

The article deals with validation of an integrative attribution algorithm based on the analysis of the author’s idiostyle using methods of interpretative linguistics with ob jectification of the available data with the help of mathematical statistics. The algo rithm addresses the identification problem of the attribution. The choice of parameters describing the individual style of ...

Added: March 12, 2025

Proceedings of the 28th Conference on Computational Natural Language Learning

Association for Computational Linguistics, 2024.

CoNLL is a conference organized yearly by SIGNLL (ACL’s Special Interest Group on Natural Language Learning), focusing on theoretically, cognitively and scientifically motivated approaches to computational linguistics. This year, CoNLL was held alongside EMNLP 2024. ...

Added: March 11, 2025

Big Data Analytics Approach with Multiple Text Types: The Case of the Computer Gaming

Aleksandr Belov, Zakharov F., Litvinenko E. et al., , in: International IoT, Electronics and Mechatronics Conference, Volume 2. Proceedings of IEMTRONICS 2024. LNEE, volume 1228Vol. 1228.: Springer Publishing Company, 2025. P. 275–287.

Added: January 26, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Threatening Expression and Target Identification in Under-Resource Languages Using NLP Techniques

Malik M. S., Lecture Notes in Computer Science 2024 Vol. 14486 P. 3–17

In recent decades, hate speech on social media platforms has been on the rise. It is highly desired to control this kind of material because it initiates unrest and harms to the society. Literature describes several forms of the hate speech and it is quite challenging to differentiate between these forms and to design an automated detection system, especially ...

Added: December 12, 2024

Document Classification via Stable Graph Patterns and Conceptual AMR Graphs

Parakal E. G., Dudyrev E., Sergei O. Kuznetsov et al., Lecture Notes in Computer Science 2024 Vol. 14914 P. 286–301

This paper proposes an approach and an associated system based on pattern structures, aimed at the classiﬁcation of documents represented as graphs. The representation of documents relies on Abstract Meaning Representation (AMR) document graphs. Given a set of AMR document graphs, the system learns characteristic graph patterns, that can be reused by an aggregate rule classiﬁer to predict the class ...

Added: September 10, 2024

Think about what you’ve learned: анализ тональности для моделирования пользовательского опыта в сфере онлайн-образования

Kirina M., Человек: образ и сущность. Гуманитарные аспекты 2024 № 2(58) С. 176–204

The article focuses on the application of opinion mining techniques to evaluate user experience on the Hyperskill educational platform, using Python, Java, and Kotlin programming projects as the basis of analysis. The study utilizes sentiment analysis and keyword extraction methods to gauge users' attitudes towards the platform, learning process, and topics covered. To achieve this, ...

Added: December 9, 2023

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 307–318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

Identifying and Visualizing Trends in Science, Technology, and Innovation Using SciBERT

Lobanova P., Bakhtin P., Sergienko Y., IEEE Transactions on Engineering Management 2024 No. 71 P. 11898–11906

Identification of science, technology, and innovation trends is a critical topic both for the scientific community and for companies that develop technologies, work on science and technology policy or invest in high tech. In this research authors demonstrate a novel approach implemented in iFORA system (developed by National Research University Higher School of Economics) using ...

Added: September 8, 2023

The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group

Afanasev I., , in: Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023).: Association for Computational Linguistics, 2023. P. 174–186.

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of ...

Added: May 15, 2023

Proceedings of Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Association for Computational Linguistics, 2023.

These proceedings include the 23 papers presented at the 10th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Both EACL and VarDial were held in Dubrovnik, Croatia, in a hybrid format, allowing participants to attend on-site or ...

Added: May 15, 2023

Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

Marseille: European Language Resources Association (ELRA), 2022.

The proceedings are organised on the basis of the 22 Tracks of the Conference on Language Resources and Evaluation (LREC) held in Marseille, France, from 20 to 25 June 2022. Major topics include corpora and annotation (including tools, systems, treebanks), information extraction and information retrieval (including ner, qa, text mining, document classification, text categorisation), applications involving lrs and evaluation (including ...

Added: February 22, 2023

Автоматическая оценка впечатлений обучающихся методами анализа тональности (на материале отзывов на онлайн-курсы на русском и английском)

Kirina M., Тельнина Л. Д., В кн.: Цифровая гуманитаристика и технологии в образовании (DHTE 2022): сб. статей III Всероссийской научно-практической конференции с международным участием. 17—18 ноября 2022 г.: ФГБОУ ВО МГППУ, 2022. С. 355–374.

В статье описывается эксперимент, направленный на сравнение эффективности инструментов анализа тональности для оценки пользовательского опыта на материале публичных отзывов на онлайнкурсы на образовательной платформе Stepik. Рассматриваются результаты автоматического извлечения сентимент-оценок пользователей на соответствующие курсы как на русском, так и на английском языках. Для русскоязычных текстов обсуждается применение словаря эмотивной лексики «КартаСловСент» и предобученной на датасете ...

Added: December 9, 2022

A hybrid lemmatiser for Old Church Slavonic

Afanasev I., / NRU HSE. Series WP BRP "Linguistics". 2021.

The article considers a lemmatiser that is developed specifically for Old Church Slavonic (OCS). The introduction underlines the problem of the lack of lemmatisers that might deal with different datasets of the OCS. The review gives a short description of previous attempts and current trends in lemmatisation. The lemmatiser is hybrid-based and uses the advantages ...

Added: December 28, 2021