Evaluation of collocation extraction methods for the Russian language
This paper focuses on empirical collocations, understood here as word co-occurrences that 1) are frequent enough to be extracted automatically and 2) may be semantically and/or syntactically bounded to various extents. Our main goal is to examine closely five window-based methods for empirical collocation extractions that are widely used in corpus-based studies, sometimes without proven efficiency. Our study evaluates the methods’ reliability for Russian data by testing two hypotheses: a) collocations listed in a professionally compiled dictionary (i.e., those considered fixed to some extent by experts in the field) should have higher rankings in automatically extracted lists of collocations, and b) collocations considered fixed expressions by native speakers should have higher rankings in automatically generated lists. Our research indicates that raw frequency, t-score, log-likelihood, and Dice give the best rankings, while MI and wFR demonstrate poorer results in both evaluations. In general, all of these evaluations, although each has its own limitations, lead to equatable results, which should be taken into account in future research.