Clusters, orders, trees: methods and applications. In Honor of Boris Mirkin's 70th Birthday
The volume is dedicated to Boris Mirkin on the occasion of his 70th birthday. In addition to his startling PhD results in abstract automata theory, Mirkin’s ground breaking contributions in various fields of decision making and data analysis have marked the fourth quarter of the 20th century and beyond. Mirkin has done pioneering work in group choice, clustering, data mining and knowledge discovery aimed at finding and describing non-trivial or hidden structures—first of all, clusters, orderings, and hierarchies—in multivariate and/or network data.
This volume contains a collection of papers reflecting recent developments rooted in Mirkin's fundamental contribution to the state-of-the-art in group choice, ordering, clustering, data mining, and knowledge discovery. Researchers, students, and software engineers will benefit from new knowledge discovery techniques and application directions.
We develop a consensus clustering framework proposed three decades ago in Russia and experimentally demonstrate that our least squares consensus clustering algorithm consistently outperforms several recent consensus clustering methods.
Procedures aggregating individual preferences into a collective choice differ in their vulnerability to manipulations. To measure it, one may consider the share of preference profiles where manipulation is possible in the total number of profiles, which is called Nitzan-Kelly's index of manipulability. The problem of manipulability can be considered in different probability models. There are three models based on anonymity and neutrality: impartial culture model (IC), impartial anonymous culture model (IAC), and impartial anonymous and neutral culture model (IANC). In contrast to the first two models, the IANC model, which is based on anonymity and neutrality axioms, has not been widely studied. In addition, there were no attempts to derive the difference of probabilities (such as Nitzan-Kelly's index) in IC and IANC analytically. We solve this problem and show in which cases the upper bound of this difference is high enough, and in which cases it is almost zero. These results enable us to simplify the computation of indices.
Abstract. A suffix-tree based method for measuring similarity of a key phrase to an unstructured text is proposed. The measure involves less computation and it does not depend on the length of the text or the key phrase. This applies to the following tasks in semantic text analysis:
Finding interrelations between key phrases over a set of texts;
Annotating a research article by topics from a taxonomy of the domain;
Clustering relevant topics and mapping clusters on a domain taxonomy.
Recently, a three-stage version of K-Means has been introduced, at which not only clusters and their centers, but also feature weights are adjusted to minimize the summary p-th power of the Minkowski p-distance between entities and centroids of their clusters. The value of the Minkowski exponent p appears to be instrumental in the ability of the method to recover clusters hidden in data. This paper advances into the problem of finding the best p for a Minkowski metric-based version of K-Means, in each of the following two settings: semi-supervised and unsupervised. This paper presents experimental evidence that solutions found with the proposed approaches are sufficiently close to the optimum.
In the course of recent ten years algorithms and technologies for network structures analysis have been applied to financial markets among other approaches. The first step of such an analysis is to describe the considered financial market via the correlation matrix of stocks prices over a certain period of time. The second step is to build a graph in which vertices represent stocks and edge weights represent correlation coefficients between the corresponding stocks. In this paper we suggest a new method of analyzing stock markets based on dividing a market into several substructures (called stars) in which all stocks are strongly correlated with a leading (central, median) stock. The method is based on the p-median model a feasible solution to which is represented by a collection of stars. Our method is able to find an exact solution for relatively small-sized markets (less than 1000 stocks) and a high-quality solution for large-sized (many thousands of stocks) markets. We observed an important ``median nesting" property of returned solutions: the p leading stocks, or medians, of the stars are repeated in the solution for p+1 stars.