NAACL HLT 2015 11th Workshop on Multiword Expressions MWE 2014
The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. Then we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. All experiments were conducted on four text collections of different domains and languages. The experiments distinguish a subgroup of tested measures that produce topranked bigrams, which demonstrate signifi- cant improvement of topic models quality for all collections, when integrated into PLSASIM algorithm.