An approach to the problem of annotation of research publications
An approach to multiple labeling research papers is explored. We develop techniques for annotating/labeling research pa- pers in informatics and computer sciences with key phrases taken from the ACM Computing Classification System. The techniques utilize a phrase-to-text relevance measure so that only those phrases that are most relevant go to the anno- tation. Three phrase-to-text relevance measures are experi- mentally compared in this setting. The measures are: (a) co- sine relevance score between conventional vector space repre- sentations of the texts coded with tf-idf weighting; (b) pop- ular characteristic of probability of term generation BM25; and (c) an in-house characteristic of conditional probability of symbols averaged over matching fragments in suffix trees representing texts and phrases, CPAMF. In an experiment conducted over a set of texts published in journals of the ACM and manually annotated by their authors, CPAMF outperforms both the cosine measure and BM25 by a wide margin.