### Book chapter

## Состоятельность оценки области определения алгоритмом спектральных вложений Грассмана-Штифеля

### In book

Many Data Mining tasks deal with data which are presented in high dimensional spaces, and the ‘curse of dimensionality’ phenomena is often an obstacle to the use of many methods for solving these tasks. To avoid these phenomena, various Representation learning algorithms are used as a first key step in solutions of these tasks to transform the original high-dimensional data into their lower-dimensional representations so that as much information about the original data required for the considered Data Mining task is preserved as possible. The above Representation learning problems are formulated as various Dimensionality Reduction problems (Sample Embedding, Data Manifold embedding, Manifold Learning and newly proposed Tangent Bundle Manifold Learning) which are motivated by various Data Mining tasks. A new geometrically motivated algorithm that solves the Tangent Bundle Manifold Learning and gives new solutions for all the considered Dimensionality Reduction problems is presented.

Let X be an unknown nonlinear smooth q-dimensional Data manifold (D-manifold) embedded in a p-dimensional space (p> q) covered by a single coordinate chart. It is assumed that the manifold's condition number is positive so X has no self-intersections. Let Xn={X1, X2,..., Xn}⊂ X⊂ Rp be a sample randomly selected from the D-manifold Xindependently of each other according to an unknown probability measure on X with strictly positive density.

In many applications, the real high-dimensional data occupy only a very small part in the high dimensional ‘observation space’ whose intrinsic dimension is small. The most popular model of such data is Manifold model which assumes that the data lie on or near an unknown manifold Data Manifold, (DM) of lower dimensionality embedded in an ambient high-dimensional input space (Manifold assumption about high-dimensional data). Manifold Learning is a Dimensionality Reduction problem under the Manifold assumption about the processed data and its goal is to construct a low-di-mensional parameterization of the DM (global low-dimensional coordinates on the DM) from a finite dataset sampled from the DM. Manifold assumption means that local neighborhood of each manifold point is equivalent to an area of low-dimensional Euclidean space. Because of this, most of Manifold Learning algorithms include two parts: ‘local part’ in which certain characteristics reflecting low-dimensional local structure of neighborhoods of all sample points are constructed and ‘global part’ in which global low-dimensional coordinates on the DM are constructed by solving certain convex optimization problem for specific cost function depending on the local characteristics. Statistical properties of ‘local part’ are closely connected with local sampling on the manifold, which is considered in the study.

The paper presents a new geometrically motivated method for non-linear regression based on Manifold learning technique. The regression problem is to construct a predictive function which estimates an unknown smooth mapping f from q-dimensional inputs to m-dimensional outputs based on a training data set consisting of given ‘input-output’ pairs. The unknown mapping f determines q-dimensional manifold M(f) consisting of all the ‘input-output’ vectors which is embedded in (q+m)-dimensional space and covered by a single chart; the training data set determines a sample from this manifold. Modern Manifold Learning methods allow constructing the certain estimator M* from the manifold-valued sample which accurately approximates the manifold. The proposed method called Manifold Learning Regression (MLR) finds the predictive function fMLR to ensure an equality M(fMLR) = M*. The MLR simultaneously estimates the m×q Jacobian matrix of the mapping f.

In many Data Analysis tasks, one deals with data that are presented in high-dimensional spaces. In practice original high-dimensional data are transformed into lower-dimensional representations (features) preserving certain subject-driven data properties such as distances or geodesic distances, angles, etc. Preserving as much as possible available information contained in the original high-dimensional data is also an important and desirable property of the representation. The real-world high-dimensional data typically lie on or near a certain unknown low-dimensional manifold (Data manifold) embedded in an ambient high-dimensional `observation' space, so in this article we assume this Manifold assumption to be fulfilled. An exact isometric manifold embedding in a low-dimensional space is possible in certain special cases only, so we consider the problem of constructing a `locally isometric and conformal' embedding, which preserves distances and angles between close points. We propose a new geometrically motivated locally isometric and conformal representation method, which employs Tangent Manifold Learning technique consisting in sample-based estimation of tangent spaces to the unknown Data manifold. In numerical experiments, the proposed method compares favourably with popular Manifold Learning methods in terms of isometric and conformal embedding properties as well as of accuracy of Data manifold reconstruction from the sample.

One of the ultimate goals of Manifold Learning (ML) is to reconstruct an unknown nonlinear low-dimensional Data Manifold (DM) embedded in a high-dimensional observation space from a given set of data points sampled from the manifold. We derive asymptotic expansion and local lower and upper bounds for the maximum reconstruction error in a small neighborhood of an arbitrary point. The expansion and bounds are defined in terms of the distance between tangent spaces to the original Data manifold and the Reconstructed Manifold (RM) at the selected point and its reconstructed value, respectively. We propose an amplification of the ML, called Tangent Bundle ML, in which proximity is required not only between the DM and RM but also between their tangent spaces. We present a new geometrically motivated Grassman&Stiefel Eigenmaps algorithm that solves this problem and gives a new solution for the ML also.