Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads
The issue of determining “the right number of clusters” in K-Means has attracted considerable interest, especially in the recent years. Cluster intermix appears to be a factor most affecting the clustering results. This paper proposes an experimental setting for comparison of different approaches at data generated from Gaussian clusters with the controlled parameters of between- and within-cluster spread to model cluster intermix. The setting allows for evaluating the centroid recovery on par with conventional evaluation of the cluster recovery. The subjects of our interest are two versions of the “intelligent” K-Means method, ik-Means, that find the “right” number of clusters by extracting “anomalous patterns” from the data one-by-one. We compare them with seven other methods, including Hartigan’s rule, averaged Silhouette width and Gap statistic, under different between- and within-cluster spread-shape conditions. There are several consistent patterns in the results of our experiments, such as that the right K is reproduced best by Hartigan’s rule – but not clusters or their centroids. This leads us to propose an adjusted version of iK-Means, which performs well in the current experiment setting.
Clustering large and heterogeneous data of user-profiles from social media is problematic as the problem of finding the optimal number of clusters becomes more critical than for clustering smaller and homo- geneous data. We propose a new approach based on the deformed R ́enyi entropy for determining the optimal number of clusters in hierarchical clustering of user-profile data. Our results show that this approach allows us to estimate R ́enyi entropy for each level of a hierarchical model and find the entropy minimum (information maximum). Our approach also shows that solutions with the lowest and the highest number of clusters correspond to the entropy maxima (minima of information).
Abstract Clustering cities based on their socio-economic development in long time period is an important issue and may be used in many ways, e.g., in strategic regional planning. In this paper we continue our recent study where cumulative attribute for each year replaces nine other attributes, called ’vector of dynamics’. In our previous paper some original ranking method was proposed. Using the same data set, here we try out some classical clustering models such as Minimum sum of squares and Harmonic means clustering. Results for the two last models are obtained using Variable neighborhood search based heuristics. A comparative study among old and new results on 120 Russian large cities are provided and analyzed.
This paper presents a further investigation into computational properties of a novel fuzzy additive spectral clustering method, Fuzzy Additive Spectral clustering (FADDIS), recently introduced by authors. Specifically, we extend our analysis to ‘difficult’ data structures from the recent literature and develop two synthetic data generators simulating affinity data of Gaussian clusters and genuine additive similarity data, with a controlled level of noise. The FADDIS is experimentally verified on these data in comparison with two state-of-the-art fuzzy clustering methods. The claimed ability of FADDIS to help in determining the right number of clusters is experimentally tested, and the role of the pseudo-inverse Laplacian data transformation in this is highlighted. A potentially useful extension of the method to biclustering is introduced.
In this paper, I discuss current developments in cluster analysis to bring forth earlier developments by E. Braverman and his team. Speciﬁcally, I begin by recalling their Spectrum clustering method and Matrix diagonalization criterion. These two include a number of userspeciﬁed parameters such as the number of clusters and similarity threshold, which corresponds to the state of aﬀairs as it was at early stages of data science developments; it remains so currently, too. Meanwhile, a data-recovery view of the Principal Component Analysis method admits a natural extension to clustering which embraces two of the most popular clustering methods, K-Means partitioning and Ward agglomerative clustering. To see that, one needs just adjusting the point of view and recognising an equivaent complementary criterion demanding the cluster to be simultaneously “large-sized” and “anomalous”. Moreover, this paradigm shows that the complementary criterion can be reformulated in terms of object-to-object similarities. This criterion appears to be equivalent to the heuristic Matrix diagonalization criterion by Dorofeyuk-Braverman. Moreover, a greedy one-by-one cluster extraction algorithm for this criterion appears to be a version of the Braverman’s Spectrum algorithm – but with automated adjustment of parameters. An illustrative example with mixed scale data completes the presentation.
In today’s era, most of the people are suffering with chronic diseases because of their lifestyle, food habits and reduction in physical activities. Diabetes is one of the most common chronic diseases which has affected to the people of all ages. Diabetes complication arises in human body due to increase of blood glucose (sugar) level than the normal level. Type-2 diabetes is considered as one of the most prevalent endocrine disorders. In this circumstance, we have tried to apply Machine learning algorithm to create the statistical prediction based model that people having diabetes can be aware of their prevalence. The aim of this paper is to detect the prevalence of diabetes relevant complications among patients with Type-2 diabetes mellitus. The processing and statistical analysis we used are Scikit-Learn, and Pandas for Python. We also have used unsupervised Machine Learning approaches known as Artificial Neural Network (ANN) and K-means Clustering for developing classification system based prediction model to judge Type-2 diabetes mellitus chronic diseases.
The paper examines the structure, governance, and balance sheets of state-controlled banks in Russia, which accounted for over 55 percent of the total assets in the country's banking system in early 2012. The author offers a credible estimate of the size of the country's state banking sector by including banks that are indirectly owned by public organizations. Contrary to some predictions based on the theoretical literature on economic transition, he explains the relatively high profitability and efficiency of Russian state-controlled banks by pointing to their competitive position in such functions as acquisition and disposal of assets on behalf of the government. Also suggested in the paper is a different way of looking at market concentration in Russia (by consolidating the market shares of core state-controlled banks), which produces a picture of a more concentrated market than officially reported. Lastly, one of the author's interesting conclusions is that China provides a better benchmark than the formerly centrally planned economies of Central and Eastern Europe by which to assess the viability of state ownership of banks in Russia and to evaluate the country's banking sector.
The paper examines the principles for the supervision of financial conglomerates proposed by BCBS in the consultative document published in December 2011. Moreover, the article proposes a number of suggestions worked out by the authors within the HSE research team.