Combining multiple features for single-word term extraction
The paper describes experiments on automatic single-word term extraction
based on combining various features of words, mainly linguistic and statistical,
by machine learning methods. Since single-word terms are much more
difficult to recognize than multi-word terms, a broad range of word features
was taken into account, among them are widely-known measures (such
as TF-IDF), some novel features, as well as proposed modifications of features
usually applied for multi-word term extraction.
A large target collection of Russian texts in the domain of banking was taken
for experiments. Average Precision was chosen to evaluate the results
of term extraction, along with the manually created thesaurus of terminology
on banking activity that was used to approve extracted terms.
The experiments showed that the use of multiple features significantly improves
the results of automatic extraction of domain-specific terms. It was
proved that logistic regression is the best machine learning method for single-
word term extraction; the subset of word features significant for term
extraction was also revealed.