?
Методы и средства извлечения терминов из текстов для терминологических задач
The current state in the field of automatic term extraction from specialized natural language texts, including scientific and technical documents, is considered. Practical applications of methods and tools for extracting terms from texts include creation of terminological dictionaries, thesauri, and glossaries of problem oriented domains, as well as extraction of keywords and construction of subject indexes for highly specialized documents.
The paper provides an overview of approaches to automatic recognition and extraction of terminological words and phrases, which cover traditional statistical methods, as well as methods based on machine learning, including learning by term features and learning using modern neural network transformer-based language models. A comparison of approaches is given, including quality assessments for term recognition and term extraction, and the most well-known software tools for automating term extraction within the statistical approach and learning by features are indicated.
The studies conducted by the authors on term recognition based on neural network language models are described, being applied to processing Russian scientific texts on mathematics and programming. The data set with terminological annotations created for training term recognition models is briefly characterized, which covers the data from seven related domains. The models were developed on the basis of pre-trained neural network model BERT, with its additional training (fine-tuning) in two ways: as a binary classifier of candidate terms (previously extracted from texts) and as a classifier for sequential labeling terminological words in texts. For the developed models, the quality of term recognition is experimentally evaluated, and a comparison with statistical method was carried out. The best quality is demonstrated by binary classification models, significantly surpassing the other approaches considered. The experiments also show the applicability of the trained models to texts in a related scientific field.