• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Article

Parsimonious Generalization of Fuzzy Thematic Sets in Taxonomies Applied to the Analysis of Tendencies of Research in Data Science

Information Sciences. 2020. Vol. 512. P. 595-615.
Frolov D., Nascimento S., Fenner T., Mirkin B.

This paper proposes a novel method, referred to as ParGenFS, for finding a most specific generalization of a query set represented by a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. The query set is generalized by “lifting” it to one or more “head subjects” in the higher ranks of the taxonomy. The head subjects should cover the query set, with the possible addition of some “gaps”, taxonomy nodes covered by the head subject but irrelevant to the query set. To decrease the numbers of gaps, we admit some “offshoots”, nodes belonging to the query set but not covered by the head subject. The method globally minimizes the total number of head subjects, gaps and offshoots, each suitably weighted. Our algorithm is applied to the structural analysis and description of a collection of 17685 abstracts of research papers published in 17 Springer journals related to Data Science for the 20-year period 1998-2017. Our taxonomy of Data Science (TDS) is extracted from the Association for Computing Machinery Computing Classification System 2012 (ACM-CCS), a six-level hierarchical taxonomy manually developed by a team of ACM experts. The TDS also includes a number of additional leaves that we added to cater for recent developments not represented in the ACM-CCS taxonomy. We find fuzzy clusters of leaf topics over the text collection, using specially developed machinery. Three of the clusters are indeed thematic, relating to the Data Science sub-areas of (a) learning, (b) information retrieval, and (c) clustering. These three clusters are then lifted in the TDS using ParGenFS, which allows us to draw some conclusions about tendencies in developments in these areas.