Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models.
Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The known models PLSA (probabilistic latent semantic analysis), LDA (latent Dirichlet allocation), SWB (special words with background), as well as new ones can be considered as special cases of the presented broad family of models. A new simple robust algorithm suitable for sparse models that do not require to estimate and store a big matrix of noise parameters is proposed. The present authors nd experimentally optimal combinations of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet smoothing performs very well and gives more than 99% of zeros in multinomial distributions without loss of perplexity.
An important text mining problem is to find, in a large collection of texts, documents related to specic topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to nd the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predened sets of keywords (that dene the topics researchers are interested in) are restricted to specic intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.
In this paper we consider the behavior of Kalman Filter state estimates in the case of distribution with heavy tails .The simulated linear state space models with Gaussian measurement noises were used. Gaussian noises in state equation are replaced by components with alpha-stable distribution with different parameters alpha and beta. We consider the case when "all parameters are known" and two methods of parameters estimation are compared: the maximum likelihood estimator (MLE) and the expectation- maximization algorithm (EM). It was shown that in cases of large deviation from Gaussian distribution the total error of states estimation rises dramatically. We conjecture that it can be explained by underestimation of the state equation noises covariance matrix that can be taken into account through the EM parameters estimation and ignored in the case of ML estimation.
Topic modeling is a powerful tool for analyzing large collections of user-generated web content, but it still suffers from problems with topic stability, which are especially important for social sciences. We evaluate stability for differenttopic models and propose a new model, granulated LDA,that samples short sequences of neighboring words at once. We show that gLDA exhibits very stable results. ©2016 Copyright held by the owner/author(s).
An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments.
Observed polar motion consists of uniform circular motions at both positive (prograde) and negative (retrograde) frequencies. Generalized Euler–Liouville equations of Bizouard, taking into account Earth's triaxiality and asymmetry of the ocean tide, show that the corresponding retrograde and prograde circular excitations are coupled at any frequency. In this work, we reconstructed the polar motion excitation in the Chandler band (prograde and retrograde). Then we compared it with geophysical excitation, filtered out in the same way from the series of the Oceanic Angular Momentum (OAM) and Atmospheric Angular Momentum (AAM) for the period 1960–2000. The agreement was found to be better in the prograde band than in the retrograde one.