Computationally refining a taxonomy by using annotated suffix trees over Wikipedia resources
A two-step approach to devising a hierarchical taxonomy of a domain is outlined. As the first step, a coarse “high-rank” taxonomy frame is built manually using the materials of the government and other representative sites. As the second step, the frame is refined topic-by topic using the Russian Wikipedia category tree and articles filtered of “noise”. A topic-to-text similarity score, based on annotated suffix trees, is used throughout. The method consists of three main stages: 1) clearing Wikipedia data of noise, such as irrelevant articles and categories; 2) refining the taxonomy frame with the remaining relevant Wikipedia categories and articles; 3) extracting key words and phrases from Wikipedia articles. Also, a set of so-called descriptors is assigned to every leaf; these are phrases explaining aspects of the leaf topic. In contrast to many existing taxonomies, our resulting taxonomy is balanced so that all the branches are of similar depths and similar numbers of leaves. The method is illustrated by its application to a mathematics domain, “Probability theory and mathematical statistics”.