Persistent Homology-based Projection Pursuit
Dimensionality reduction problem is stated as finding a mapping f:X ∈ R m → Z ∈ R n , where ≪ m while preserving some relevant properties of the data. We formulate topology-preserving dimensionality reduction as finding the optimal orthogonal projection to the lower-dimensional subspace which minimizes discrepancy between persistent diagrams of the original data and the projection. This generalizes the classic projection pursuit algorithm which was originally designed to preserve the number of clusters, i.e. the 0-order topological invariant of the data. Our approach further allows to preserve k-th order invariants within the principled framework. We further pose the resulting optimization problem as the Riemannian optimization problem which allows for a natural and efficient solution.
In this paper we propose a method of movie series recommender system development. Our recommender system is content-based, and movie series are represented by their scripts. We experiment with several semantic similarity measures, lexico-morphological metrics, keywords and vector space models to extract similar movie series. Evaluation is conducted in the experiment with informants. The best results are achieved by distributional semantic approach (i.e., using word2vec technology).
Many Data Mining tasks deal with data which are presented in high dimensional spaces, and the ‘curse of dimensionality’ phenomena is often an obstacle to the use of many methods for solving these tasks. To avoid these phenomena, various Representation learning algorithms are used as a first key step in solutions of these tasks to transform the original high-dimensional data into their lower-dimensional representations so that as much information about the original data required for the considered Data Mining task is preserved as possible. The above Representation learning problems are formulated as various Dimensionality Reduction problems (Sample Embedding, Data Manifold embedding, Manifold Learning and newly proposed Tangent Bundle Manifold Learning) which are motivated by various Data Mining tasks. A new geometrically motivated algorithm that solves the Tangent Bundle Manifold Learning and gives new solutions for all the considered Dimensionality Reduction problems is presented.
This volume contains the extended version of selected talks given at the international research workshop "Coping with Complexity: Model Reduction and Data Analysis", Ambleside, UK, August 31 – September 4, 2009. The book is deliberately broad in scope and aims at promoting new ideas and methodological perspectives. The topics of the chapters range from theoretical analysis of complex and multiscale mathematical models to applications in e.g., fluid dynamics and chemical kinetics.
The problem of community detection in a network with features at its nodes takes into account both the graph structure and node features. The goal is to find relatively dense groups of interconnected entities sharing some features in common. We apply the so-called data recovery approach to the problem by combining the least-squares recovery criteria for both, the graph structure and node features. In this way, we obtain a new clustering criterion and a corresponding algorithm for finding clusters/communities one-by-one. We show that our proposed method is effective on real-world data, as well as on synthetic data involving either only quantitative features or only categorical attributes or both. Our algorithm appears competitive against state-of-the-art algorithms.
In this article, our ultimate goal is to transform a graph’s adjacency matrix into a distance matrix. Because cluster density is not observable prior to the actual clustering, our goal is to find a distance whose pairwise minimisation will lead to densely connected clusters. Our thesis is centred on the widely accepted notion that strong clusters are sets of vertices with high induced subgraph density. We posit that vertices sharing more connections are closer to each other than vertices sharing fewer connections. This definition of distance differs from the usual shortest-path distance. At the cluster level, our thesis translates into low mean intra-cluster distances, which reflect high densities. We compare three distance measures from the literature. Our benchmark is the accuracy of each measure’s reflection of intra-cluster density, when aggregated (averaged) at the cluster level. We conduct our tests on synthetic graphs, where clusters and intra-cluster density are known in advance. In this article, we restrict our attention to unweighted graphs with no self-loops or multiple edges. We examine the relationship between mean intra-cluster distances and intra-cluster densities. Our numerical experiments show that Jaccard and Otsuka-Ochiai offer very accurate measures of density, when averaged over vertex pairs within clusters.
Neuronal oscillations have been shown to be associated with perceptual, motor and cognitive brain operations. While complex spatio-temporal dynamics are a hallmark of neuronal oscillations, they also represent a formidable challenge for the proper extraction and quantification of oscillatory activity with non-invasive recording techniques such as EEG and MEG. In order to facilitate the study of neuronal oscillations we present a general-purpose pre-processing approach, which can be applied for a wide range of analyses including but not restricted to inverse modeling and multivariate single-trial classification. The idea is to use dimensionality reduction with spatio-spectral decomposition (SSD) instead of the commonly and almost exclusively used principal component analysis (PCA). The key advantage of SSD lies in selecting components explaining oscillations-related variance instead of just any variance as in the case of PCA. For the validation of SSD pre-processing we performed extensive simulations with different inverse modeling algorithms and signal-to-noise ratios. In all these simulations SSD invariably outperformed PCA often by a large margin. Moreover, using a database of multichannel EEG recordings from 80 subjects we show that pre-processing with SSD significantly increases the performance of single-trial classification of imagined movements, compared to the classification with PCA pre-processing or without any dimensionality reduction. Our simulations and analysis of real EEG experiments show that, while not being supervised, the SSD algorithm is capable of extracting components primarily relating to the signal of interest often using as little as 20% of the data variance, instead of > 90% variance as in case of PCA. Given its ease of use, absence of supervision, and capability to efficiently reduce the dimensionality of multivariate EEG/MEG data, we advocate the application of SSD pre-processing for the analysis of spontaneous and induced neuronal oscillations in normal subjects and patients.
The paper presents a new geometrically motivated method for non-linear regression based on Manifold learning technique. The regression problem is to construct a predictive function which estimates an unknown smooth mapping f from q-dimensional inputs to m-dimensional outputs based on a training data set consisting of given ‘input-output’ pairs. The unknown mapping f determines q-dimensional manifold M(f) consisting of all the ‘input-output’ vectors which is embedded in (q+m)-dimensional space and covered by a single chart; the training data set determines a sample from this manifold. Modern Manifold Learning methods allow constructing the certain estimator M* from the manifold-valued sample which accurately approximates the manifold. The proposed method called Manifold Learning Regression (MLR) finds the predictive function fMLR to ensure an equality M(fMLR) = M*. The MLR simultaneously estimates the m×q Jacobian matrix of the mapping f.
MARAMI 2020 Modèles & Analyse des Réseaux : Approches Mathématiques & Informatiques - Network Modeling and Analysis 2020
Proceedings of MARAMI 2020 - Modèles & Analyse des Réseaux : Approches Mathématiques & Informatiques - The 11th Conference on Network Modeling and Analysis
Virtual Conference, October 14-15, 2020.
CIRAD, UMR Tetis, Montpellier, France TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traﬃc is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the ﬁnal node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a ﬁnite-dimensional system of diﬀerential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of diﬀerential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.
Event logs collected by modern information and technical systems usually contain enough data for automated process models discovery. A variety of algorithms was developed for process models discovery, conformance checking, log to model alignment, comparison of process models, etc., nevertheless a quick analysis of ad-hoc selected parts of a journal still have not get a full-fledged implementation. This paper describes an ROLAP-based method of multidimensional event logs storage for process mining. The result of the analysis of the journal is visualized as directed graph representing the union of all possible event sequences, ranked by their occurrence probability. Our implementation allows the analyst to discover process models for sublogs defined by ad-hoc selection of criteria and value of occurrence probability
Existing approaches suggest that IT strategy should be a reflection of business strategy. However, actually organisations do not often follow business strategy even if it is formally declared. In these conditions, IT strategy can be viewed not as a plan, but as an organisational shared view on the role of information systems. This approach generally reflects only a top-down perspective of IT strategy. So, it can be supplemented by a strategic behaviour pattern (i.e., more or less standard response to a changes that is formed as result of previous experience) to implement bottom-up approach. Two components that can help to establish effective reaction regarding new initiatives in IT are proposed here: model of IT-related decision making, and efficiency measurement metric to estimate maturity of business processes and appropriate IT. Usage of proposed tools is demonstrated in practical cases.