Composing Tree Graphical Models with Persistent Homology Features for Clustering Mixed-Type Data
Clustering data with both continuous and discrete attributes is a challenging task. Existing methods lack a principled probabilistic formulation. In this paper, we propose a clustering method based on a tree-structured graphical model to describe the generation process of mixed-type data. Our tree-structured model factorized into a product of pairwise interactions, and thus localizes the interaction between feature variables of different types. To provide a robust clustering method based on the tree-model, we adopt a topographical view and compute peaks of the density function and their attractive basins for clustering. Furthermore, we leverage the theory from topology data analysis to adaptively merge trivial peaks into large ones in order to achieve meaningful clusterings. Our method outperforms state-of-the-art methods on mixed-type data.
The paper describes the results of an experimental study of topic models applied to the task of single-word term extraction. The experiments encompass several probabilistic and non-probabilistic topic models and demonstrate that topic information improves the quality of term extraction, as well as NMF with KL-divergence minimization is the best among the models under study.
This book constitutes the proceedings of the 23rd International Symposium on Foundations of Intelligent Systems, ISMIS 2017, held in Warsaw, Poland, in June 2017. The 56 regular and 15 short papers presented in this volume were carefully reviewed and selected from 118 submissions. The papers include both theoretical and practical aspects of machine learning, data mining methods, deep learning, bioinformatics and health informatics, intelligent information systems, knowledge-based systems, mining temporal, spatial and spatio-temporal data, text and Web mining. In addition, four special sessions were organized; namely, Special Session on Big Data Analytics and Stream Data Mining, Special Session on Granular and Soft Clustering for Data Science, Special Session on Knowledge Discovery with Formal Concept Analysis and Related Formalisms, and Special Session devoted to ISMIS 2017 Data Mining Competition on Trading Based on Recommendations, which was launched as a part of the conference.
Market graph is built on the basis of some similarity measure for financial asset returns. The paper considers two similarity measures: classic Pearson correlation and sign correlation. We study the associated market graphs and compare the conditional risk of the market graph construction for these two measures of similarity. Our main finding is that the conditional risk for the sign correlation is much better than for the Pearson correlation for larger values of threshold for several probabilistic models. In addition, we show that for some model the conditional risk for sign correlation dominates over the conditional risk for Pearson correlation for all values of threshold. These properties make sign correlation a more appropriate measure for the maximum clique analysis.
Recently proposed distance dependent Chinese Restaurant Process (ddCRP) generalizes extensively used Chinese Restaurant Process (CRP) by accounting for dependencies between data points. Its posterior is intractable and so far only MCMC methods were used for inference. Because of very different nature of ddCRP no prior developments in variational methods for Bayesian nonparametrics are appliable. In this paper we propose novel variational inference for important sequential case of ddCRP (seqddCRP) by revealing its connection with Laplacian of random graph constructed by the process. We develop efficient algorithm for optimizing variational lower bound and demonstrate its efficiency comparing to Gibbs sampler. We also apply our variational approximation to CRP-equivalent seqddCRP-mixture model, where it could be considered as alternative to one based on truncated stick-breaking representation. This allowed us to achieve significantly better variational lower bound than variational approximation based on truncated stick breaking for Dirichlet process.
This article represents a new technique for collaborative filtering based on pre-clustering of website usage data. The key idea involves using clustering methods to define groups of different users.
This is a textbook in data analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. According to this view, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relations. Visualization, in this context, is a way of presenting results in a cognitively comfortable way. The term summarization is understood quite broadly here to embrace not only simple summaries like totals and means, but also more complex summaries such as the principal components of a set of features or cluster structures in a set of entities.
The material presented in this perspective makes a unique mix of subjects from the fields of statistical data analysis, data mining, and computational intelligence, which follow different systems of presentation.
A vast amount of documents in the Web have duplicates, which is a challenge for developing efficient methods that would compute clusters of similar documents. In this paper we use an approach based on computing (closed) sets of attributes having large support (large extent) as clusters of similar documents. The method is tested in a series of computer experiments on large public collections of web documents and compared to other established methods and software, such as biclustering, on same datasets. Practical efficiency of different algorithms for computing frequent closed sets of attributes is compared.
Abstract. The paper describes the results of an experimental study of topic models applied to the task of single-word term extraction. The experiments encompass several probabilistic and non-probabilistic topic models and demonstrate that topic information improves the quality of term extraction, as well as NMF with KL-divergence minimization is the best among the models under study.
In this work, we study the optimal risk sharing problem for an insurer between himself and a reinsurer in a dynamical insurance model known as the Kramer–Lundberg risk process, which, unlike known models, models not per claim reinsurance but rather periodic reinsurance of damages over a given time interval. Here we take into account a natural upper bound on the risk taken by the reinsurer. We solve optimal control problems on an infinite time interval for mean-variance optimality criteria: a linear utility functional and a stationary variation coefficient. We show that optimal reinsurance belongs to the class of total risk reinsurances. We establish that the most profitable reinsurance is the stop-loss reinsurance with an upper limit. We find equations for the values of parameters in optimal reinsurance strategies.
This proceedings publication is a compilation of selected contributions from the “Third International Conference on the Dynamics of Information Systems” which took place at the University of Florida, Gainesville, February 16–18, 2011. The purpose of this conference was to bring together scientists and engineers from industry, government, and academia in order to exchange new discoveries and results in a broad range of topics relevant to the theory and practice of dynamics of information systems. Dynamics of Information Systems: Mathematical Foundation presents state-of-the art research and is intended for graduate students and researchers interested in some of the most recent discoveries in information theory and dynamical systems. Scientists in other disciplines may also benefit from the applications of new developments to their own area of study.