Извлечение однословных терминов из текстовых коллекций на основе методов машинного обучения
The paper describes experiments on automatic single-word term extraction based on combining various features of words, mainly linguistic and statistical, by machine learning methods. Since single-word terms are much more difficult to recognize than multi-word terms, a broad range of word features was taken into account, among them are widely-known measures (such as TF-IDF), some novel features, as well as proposed modifications of features usually applied for multi-word term extraction. A large target collection of Russian texts in the domain of banking was taken for experiments. Average Precision was chosen to evaluate the results of term extraction, along with the manually created thesaurus of terminology on banking activity that was used to approve extracted terms. The experiments showed that the use of multiple features significantly improves the results of automatic extraction of domain-specific terms. It was proved that logistic regression is the best machine learning method for single- word term extraction; the subset of word features significant for term extraction was also revealed.
Models for effective term extraction can depend on the type of a terminological resource under construction. In this paper we study term extraction models for realworking information-retrieval thesauri. The first thesaurus is the English version of EuroVoc thesaurus, the second one is the Russian Banking thesaurus. We study singleword and two-word term extraction separately to reveal the best features and feature combinations, compare best models for two thesauri. In particular, we found for this type of terminological resources the use of association measures does not improve the quality of two-word term extraction based on combining multiple features.
The present article continues the investigation of the Soqotri verbal system undertaken by the Russian-Soqotri fieldwork team. The article focuses on the so-called “weak” and “geminated” roots in the basic stem. The investigation is based on the analysis of full paradigms (perfect, imperfect and jussive) of more than 170 “weak” and “geminated” Soqotri verbs.