Модификации EM-алгоритма для вероятностного тематического моделирования
Probabilistic topic models discover a low-dimensional interpretable representation of text corpora
by estimating a multinomial distribution over topics for each document and a multinomial
distribution over terms for each topic. A unied family of expectation-maximization (EM) like
algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in
any combinations is considered. The known models PLSA (probabilistic latent semantic analysis),
LDA (latent Dirichlet allocation), SWB (special words with background), as well as new
ones can be considered as special cases of the presented broad family of models. A new simple robust
algorithm suitable for sparse models that do not require to estimate and store a big matrix
of noise parameters is proposed. The present authors nd experimentally optimal combinations
of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet
smoothing performs very well and gives more than 99% of zeros in multinomial distributions
without loss of perplexity.
Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models.
The aim of this article is to analyze the discursive background for the characters of teachers in the Soviet school story of the afterwar period. The 1,8 million words corpus for the study was compiled of the novels about school and schooling by 37 authors, written in 1940-s — 1980-s. The contents of the episodes where the keywords (headmaster, deputy headmaster, teacher, female teacher) were mentioned was analyzed automatically with the help of probabilistic topic modeling (LDA). Topics significantly more or less common in these episodes than in the whole corpus were used to characterize discursive context for the keywords. Judging by the thematic profile the term ‘female teacher’ is opposed to all the rest, Meaningful contrasts distinguishing the thematic ptofiles of the terms are: disourse of the upbringing and everyday schooling, komsomol and pioneers, emotions and gender.
An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments.
We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model and show that it is more sparse and performs better that regularized models like LDA.
This paper considers an approach to solving the problem of binary classification of objects. This approach is based on representing one of the classes by a sequence of Gaussian mixtures with further introduction of threshold decision rules. A method of constructing hierarchical sequences of Gaussian mixtures using the partial EM algorithm is proposed. We compare classifiers that use single Gaussian mixtures, cascades based on sequences of independent mixtures, cascades based on hierarchical sequences of mixtures, and classifiers that use trees of Gaussian densities for decision making. The theoretical estimates of computational costs for these classifiers are provided. The classifiers are tested on simulated data. The results are presented as the relations between the computational cost of classification and the obtained values of error criteria.
An important text mining problem is to find, in a large collection of texts, documents related to specic topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to nd the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predened sets of keywords (that dene the topics researchers are interested in) are restricted to specic intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.
We consider certain spaces of functions on the circle, which naturally appear in harmonic analysis, and superposition operators on these spaces. We study the following question: which functions have the property that each their superposition with a homeomorphism of the circle belongs to a given space? We also study the multidimensional case.
We consider the spaces of functions on the m-dimensional torus, whose Fourier transform is p -summable. We obtain estimates for the norms of the exponential functions deformed by a C1 -smooth phase. The results generalize to the multidimensional case the one-dimensional results obtained by the author earlier in “Quantitative estimates in the Beurling—Helson theorem”, Sbornik: Mathematics, 201:12 (2010), 1811 – 1836.
We consider the spaces of function on the circle whose Fourier transform is p-summable. We obtain estimates for the norms of exponential functions deformed by a C1 -smooth phase.
This proceedings publication is a compilation of selected contributions from the “Third International Conference on the Dynamics of Information Systems” which took place at the University of Florida, Gainesville, February 16–18, 2011. The purpose of this conference was to bring together scientists and engineers from industry, government, and academia in order to exchange new discoveries and results in a broad range of topics relevant to the theory and practice of dynamics of information systems. Dynamics of Information Systems: Mathematical Foundation presents state-of-the art research and is intended for graduate students and researchers interested in some of the most recent discoveries in information theory and dynamical systems. Scientists in other disciplines may also benefit from the applications of new developments to their own area of study.