• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site
Of all publications in the section: 21
Sort:
by name
by year
Article
K. V. Vorontsov. Journal of machine learning and data analysis. 2016. Vol. 2. No. 2. P. 187-200.

Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation of topical hierarchies into other types of topic models. The authors use non-Bayesian multicriteria approach called Additive Regularization of Topic Models (ARTM), which enables to combine any topic models formalized via log-likelihood maximization with additive regularization criteria. In this work, such formalization is proposed for topical hierarchies. Hence, the hierarchical ARTM (hARTM) can be easily adapted to a wide class of text mining problems, e. g., for learning topical hierarchies from multimodal and multilingual heterogeneous data of scientific digital libraries or social media. The authors focus on topical hierarchies that allow a topic to have several parent topics which is important for multidisciplinary collections of scientific papers. The regularization approach allows one to control the sparsity of the parent–child relation and automatically determine the number of subtopics for each topic. Before learning the hierarchy, it is necessary to fix the number of topics for each layer. The additive regularization does not complicate the learning algorithm; so, this approach is well scalable on large text collections.

Added: Oct 19, 2017
Article
N. A. Chirkova, K. V. Vorontsov. Journal of machine learning and data analysis. 2016. Vol. 2. No. 2. P. 187-200.

Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation of topical hierarchies into other types of topic models. The authors use non-Bayesian multicriteria approach called Additive Regularization of Topic Models (ARTM), which enables to combine any topic models formalized via log-likelihood maximization with additive regularization criteria. In this work, such formalization is proposed for topical hierarchies. Hence, the hierarchical ARTM (hARTM) can be easily adapted to a wide class of text mining problems, e. g., for learning topical hierarchies from multimodal and multilingual heterogeneous data of scientific digital libraries or social media. The authors focus on topical hierarchies that allow a topic to have several parent topics which is important for multidisciplinary collections of scientific papers. The regularization approach allows one to control the sparsity of the parent–child relation and automatically determine the number of subtopics for each topic. Before learning the hierarchy, it is necessary to fix the number of topics for each layer. The additive regularization does not complicate the learning algorithm; so, this approach is well scalable on large text collections.

Added: Oct 19, 2017
Article
Akopov A. S., Beklaryan A., Beklaryan L. A. et al. Journal of machine learning and data analysis. 2016. Vol. 2. No. 1. P. 104-115.

In the article, actual problems of modeling of ecologic-economic systems on the example of the Republic of Armenia (RA) are considered. Based on methods of agent modeling and system dynamics the simulation model of ecological-economic system, which has allowed constructing the RA Ecological Map was created. The important purpose of the offered approach is search of scenarios of rational modernization of the agent-enterprises, which are the main sources of emissions of emissions with simultaneous definition of effective strategy of the government regulation. The bi-criterial optimization problem for the ecological-economic system of RA is formulated and solved with the help of the developed genetic algorithm

Added: Aug 23, 2016
Article
Izmailov P., Kropotov D. Journal of machine learning and data analysis. 2017. Vol. 3. No. 1. P. 20-35.

Background: Gaussian processes (GP) provide an elegant and effective approach to learning in kernel machines. This approach leads to a highly interpretable model and allows using the Bayesian framework for model adaptation and incorporating the prior knowledge about the problem. The GP framework is successfully applied to regression, classification, and dimensionality reduction problems. Unfortunately, the standard methods for both GP-regression and GP-classification scale as O(n 3 ), where n is the size of the dataset, which makes them inapplicable to big data problems. A variety of methods have been proposed to overcome this limitation both for regression and classification problems. The most successful recent methods are based on the concept of inducing inputs. These methods reduce the computational complexity to O(nm2 ) where m is the number of inducing inputs with m typically much less than n. The present authors focus on classification. The current state-of-the-art method for this problem is based on stochastic optimization of an evidence lower bound (ELBO) that depends on O(m2 ) parameters. For complex problems, the required number of inducing points m is fairly big, making the optimization in this method challenging. Methods: The structure of variational lower bound that appears in inducing input GP classification has been analyzed. First, it has been noted that using quadratic approximation of several terms in this bound, it is possible to obtain analytical expressions for optimal values of most of the optimization parameters, thus sufficiently reducing the dimension of optimization space. Then, two methods have been provided for constructing necessary quadratic approximations: one is based on Jaakkola–Jordan bound for logistic function and the other is derived using Taylor expansion. Results: Two new variational lower bounds have been proposed for inducing input GP classification that depend on a number of parameters. Then, several methods have been suggested for optimization of these bounds and the resulting algorithms have been compared with the state-of-the-art approach based on stochastic optimization. Experiments on a bunch of classification datasets show that the new methods perform the same or better results than the existing one. However, new methods do not require any tunable parameters and can work in settings within a big range of n and m values, thus significantly simplifying training of GP classification models.

Added: Dec 6, 2018
Article
Beklaryan L. A., Beklaryan A. Journal of machine learning and data analysis. 2018. Vol. 4. No. 4. P. 220-234.

The problem of existence of soliton solutions (solutions of the traveling wave type) for the Korteweg-de Vries equation with a polynomial potential is considered on the basis of the approach within which the presence of a one-to-one correspondence of such solutions with solutions of the induced functional differential equation of pointwise type is demonstrated. On this path, conditions for the existence and uniqueness of solutions of the traveling wave type, with the growth restrictions both in time and in space, arise. It is very important that the conditions for the existence of a traveling wave solution are formed in terms of the right-hand side of the equation and the characteristics of the traveling wave, without using either the linearization and spectral properties of the corresponding equation in variations. Conditions for the existence of periodic soliton solutions are considered separately, and the possibility of transition from systems with a quasilinear potential to systems with a polynomial potential with conservation of corresponding existence theorems is demonstrated. Numerical implementation of such solutions is given.

Added: Jan 11, 2019
Article
Belomestny D., Panov V., Spokoiny V. Journal of machine learning and data analysis. 2012. Vol. 1. No. 3. P. 140-147.

Let a high-dimensional random vector $\vX$ be represented as a sum of two components - a  signal $\vS$ that belongs to some low-dimensional linear subspace $\S$,  and a noise component $\vN$.  This paper presents a new approach for estimating the subspace $\S$ based on the ideas of the Non-Gaussian Component Analysis. Our approach avoids the technical difficulties that usually appear in similar methods - it requires neither the estimation of the inverse covariance  matrix of $\vX$ nor the estimation of the covariance matrix of $\vN.

Added: Sep 23, 2013
Article
Панов А. И. Машинное обучение и анализ данных. 2014. Т. 1. № 7. С. 863-874.
Added: Oct 12, 2015
Article
K.V. Vorontsov, Sokolov E., Frey A. Journal of machine learning and data analysis. 2013. Vol. 1. No. 6. P. 734-743.

Computable combinatorial data dependent on generalization bounds are studied. This approach is based on simpli ed probabilistic assumptions: it is assumed that the instance space is nite, the labeling function is deterministic, and the loss function is binary. A random walk across a set of linear classi ers with low error rate is used to compute the bound eciently. The experimental evidence to con rm that this approach leads to practical over tting bounds in classi cation tasks is provided.

Added: May 6, 2014
Article
Хачатрян Н. К., Бекларян Л. А. Машинное обучение и анализ данных. 2015. Т. 1. № 13. С. 1815-1826.
Added: Jul 2, 2016
Article
Воронцов К. В., Потапенко А. А. Машинное обучение и анализ данных. 2013. Т. 1. № 6. С. 657-686.

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A uni ed family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The known models PLSA (probabilistic latent semantic analysis), LDA (latent Dirichlet allocation), SWB (special words with background), as well as new ones can be considered as special cases of the presented broad family of models. A new simple robust algorithm suitable for sparse models that do not require to estimate and store a big matrix of noise parameters is proposed. The present authors nd experimentally optimal combinations of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet smoothing performs very well and gives more than 99% of zeros in multinomial distributions without loss of perplexity.

Added: Feb 19, 2015
Article
К.В. Воронцов, Потапенко А. Машинное обучение и анализ данных. 2013. Т. 1. № 6. С. 657-686.

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A uni ed family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The known models PLSA (probabilistic latent semantic analysis), LDA (latent Dirichlet allocation), SWB (special words with background), as well as new ones can be considered as special cases of the presented broad family of models. A new simple robust algorithm suitable for sparse models that do not require to estimate and store a big matrix of noise parameters is proposed. The present authors nd experimentally optimal combinations of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet smoothing performs very well and gives more than 99% of zeros in multinomial distributions without loss of perplexity.

Added: May 6, 2014
Article
Рябенко Е А Машинное обучение и анализ данных. 2014. Т. 1. № 7. С. 800-816.
Added: Oct 14, 2016
Article
Хусаинов Ф. И., Вальков А. С., Кожанов Е. М. и др. Машинное обучение и анализ данных. 2012. Т. 1. № 4. С. 448-465.

The authors propose a method of non-parametric forecasting of railroad stations occupancy according to historical data. The algorithm is based on convolution of empirical density of distribution of time series values and loss function. The features of autoregressive prognostic model are investigated. The algorithm is illustrated by railroad stations occupancy data in Omsk region in 2007 and 2008.

Added: Mar 7, 2019
Article
Филипенков Н. В., Петрова М. А. Машинное обучение и анализ данных. 2014. Т. 1. № 9. С. 1215-1231.
Added: Dec 8, 2018
Article
Черноусов В. О., Савченко А. В. Машинное обучение и анализ данных. 2014. Т. 1. № 10. С. 1369-1381.

Background: The problem of video-based detection of the moving forklift truck is explored. It is shown that the detection quality of the state-of-the-art local descriptors (SURF, SIFT, FAST, ORB) is not satisfactory if the resolution is low and the lighting is changed dramatically.

Methods: In this paper we propose to use a simple mathematical morphological algorithm to detect the presence of a cargo on the forklift truck. At first, the movement direction is estimated by the updating motion history image method and the front part of the moving object is obtained. Next, contours are detected and binary morphological operations in front of the moving object are used to estimate simple geometric features of empty forklift.

Results: Our experimental study shows that the best results are achieved if the bounding rectangles of empty forklift contours are used as an object validation rule. Namely, FAR and FRR of empty cargo detection is 7\% and 50\% lower than FAR and FRR of the FAST descriptor. The proposed method is much more resistant to the effect of additive noise. The average frame processing time for our morphological algorithm is 5 ms (compare with 35 ms. of FAST method)

Conclusions: The proposed morphological method is task specific and can be used only for forklift truck detection. Additional detection principles need to be added to adopt algorithm for other moving object detection in noisy environment.

Added: Feb 26, 2015
Article
Хусаинов Ф. И., Вальков А. С., Кожанов Е. М. и др. Машинное обучение и анализ данных. 2013. Т. 1. № 5. С. 503-516.

The problem of detecting causal relationships between time series is studied. The authors propose a forecasting model that considers detected relationships. The model is aimed to forecast the utilization of a railway junction station. The model relies on the history of a junction station utilization as well as on the time series for the main financial instruments and regulations. Expert’s assessments are used to construct the model. A method that evaluates plausibility of the expert’s assessments is proposed. The method is illustrated with the Russian Railways data.

Added: Mar 3, 2019
Article
Вознесенская Т. В., Леднов Д. А. Машинное обучение и анализ данных. 2018. Т. 4. № 4. С. 266-279.

This paper is toward the system of automatic text summarization developed by «DC – Systems» company in cooperation with the faculty of computer science at HSE. The summary is a concise description of the text in terms of its content and meaning, i.e. from the point of view of its semantics. The purpose of the summarization is to reduce the text as much as possible while maintaining the main content. A summary in this article is built using syntactically correlated word combinations. In this case, the possible additional meanings of separate fragments of the text are neglected. The quality of the summary is evaluated by a matching to the source text in terms of semantics.

                The main problem is split into two parts: an evaluation of the whole text semantics, without subdivision into parts, and the text transformation to derive an annotation.

The architecture of the developed system and the main algorithm are described.  An example of summary derived by the system and its quality evaluation has been provided. The current version of the system has following restrictions: it does not permit any formulas and special signs.

Added: Oct 5, 2018
Article
Лепский А. Е. Машинное обучение и анализ данных. 2014. Т. 1. № 8. С. 949-965.

This paper is devoted to study of stability of comparison of histograms with help of different probability methods.   Background: The comparison of histograms is necessary in many applied problems of data processing. The comparison of type ”more-less” is considered in this paper. But the histograms may be distorted. The nature of these distortions can be different. Then we have a problem to find the conditions on distortions under which the comparison of the two histograms is not changed.   Methods: There are many approaches to comparison of histograms. The three popular proba- bilistic methods of comparison of histograms are considered in this paper: comparison of math- ematical expectations, comparison with help of principle of stochastic dominance, comparison with help of stochastic precedence. We consider the interval distortions of histograms in this paper.   Results:The necessary and sufficient conditions of preservation for comparison of distorted his- tograms found with respect to different probability indices of comparison. The description of set of admissible distortions preserving the comparison of two histograms found. The characteristics of stability of histograms to distortion are introduced. These characteristics are calculated for histograms of USE (Unified State Exam) of applicants admitted in 2012 in Russian universities. It is shown that the stability of comparison of histograms to distortion can does not correspond to the values of difference index of comparison (margin).   Conclusions: The found conditions invariability of comparing histograms can be used to es- timate the reliability of results of different rankings, data processing, etc. in terms of different types of uncertainty: stochastic uncertainty, the uncertainty associated with the distortion of the data in filling data gaps, etc.

Added: Oct 1, 2014
Article
Савченко А. В. Машинное обучение и анализ данных. 2015. Т. 1. № 11. С. 1500-1516.
Added: Sep 10, 2015
Article
Бекларян Л. А., Макаров В. Л. Машинное обучение и анализ данных. 2015. Т. 10. С. 1385-1395.

The Henning model of population behavior and its modifications are considered. Modifications of the model are made to overcome some disadvantages of Henning model, which are connected to death of the whole population. This subject is important to study, since such phenomenons may be observed as in unexplored wilderness and in human civilization. Another one model is presented, in this model, in contrast to Henning model and its modifications, interaction is determined endogenously, i. e. interaction, based on reaction of instinct type, is replaced by using elements of ethics.

Added: Feb 21, 2015
Article
Бекларян Л. А., Макаров В. Л., Белоусов Ф. А. Машинное обучение и анализ данных. 2014. Т. 10. С. 1385-1395.

The Henning model of population behavior and its modifications are considered. Modifications of the model are made to overcome some disadvantages of Henning model, which are connected to death of the whole population. This subject is important to study, since such phenomenons may be observed as in unexplored wilderness and in human civilization. Another one model is presented, in this model, in contrast to Henning model and its modifications, interaction is determined endogenously, i. e. interaction, based on reaction of instinct type, is replaced by using elements of ethics.

Added: Mar 30, 2015