?
Finding an appropriate generalization for a fuzzy thematic set in taxonomy
Издательский дом ВШЭ
,
2018.
No. 4.
This paper proposes a novel method, referred to as ParGenFS, for finding a most specific generalization of a query set, represented by a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. This generalization lifts the query set to one or several head subjects in the higher ranks of the taxonomy. The head subject is supposed to tightly cover the query set, however dispersed that can be over branches of the tree, possibly bringing in some gaps, that are taxonomy nodes covered by the head subject but irrelevant to the set. To balance that, we admit some offshoots, that are nodes belonging to the query set but not covered by the head subject. The method globally minimizes the total number of head subjects and gaps and offshoots, differently weighted. Our algorithm is applied to the structural analysis and description of a collection of 17685 abstracts of research papers published in 17 Springer journals on data science for the 20-years period 1998–2017. Our taxonomy of Data Science (DST) is extracted from the international Association for Computing Machinery Computing Classification System 2012 (ACM-CCS), a six-layer hierarchical taxonomy manually developed by a team of ACM experts. The DST also involves a number of additions detailing the leaves of the ACM-CCS taxonomy and added by ourselves. We find fuzzy clusters of leaf topics over the text collection, with a specially developed machinery. Three of the clusters are thematic indeed, relating to Data Science sub-areas: (a) learning, (b) information retrieval, and (c) clustering. These three clusters are lifted with ParGenFS in the DST, which allows us to make some conclusions of the tendencies of the developments in these areas.