Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Konstantin Vorontsov; Anna Potapenko

?

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

P. 29–46.

Konstantin Vorontsov, Anna Potapenko

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models.

Language: English

Full text

Text on another site

Keywords: EM-algorithm latent Dirichlet allocation тематические модели аддитивная регуляризация probabilistic topic modeling regularization of ill-posed inverse problems stochastic matrix factorization Probabilistic latent sematic analysis

In book

Communications in Computer and Information Science

Vol. 436: Analysis of Images, Social Networks and Texts. Third International Conference, AIST 2014 Yekaterinburg, Russia, April 10–12, 2014 Revised Selected Papers. , Cham: Springer, 2014.

Renormalization approach to the task of determining the number of topics in topic modeling

Koltsov S., Ignatenko V., , in: Intelligent Computing: SAI 2020: Volume 1* 1. Vol. 1228.: Switzerland: Springer, 2020. P. 234–247.

Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method ...

Added: November 11, 2019

Additive Regularization for Hierarchical Multimodal Topic Modeling

N. A. Chirkova, K. V. Vorontsov, Journal of machine learning and data analysis 2016 Vol. 2 No. 2 P. 187–200

Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation ...

Added: October 19, 2017

Additive Regularization for Hierarchical Multimodal Topic Modeling

K. V. Vorontsov, Journal of machine learning and data analysis 2016 Vol. 2 No. 2 P. 187–200

Added: October 19, 2017

Stable topic modeling for web science: Granulated LDA

Koltsov S., Nikolenko S. I., Koltsova O. et al., , in: WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference.: Elsevier, 2016. P. 342–343.

Topic modeling is a powerful tool for analyzing large collections of user-generated web content, but it still suffers from problems with topic stability, which are especially important for social sciences. We evaluate stability for differenttopic models and propose a new model, granulated LDA,that samples short sequences of neighboring words at once. We show that gLDA ...

Added: October 24, 2016

Convergence of an alternating maximization procedure

Andresen A., Spokoiny V., Journal of Machine Learning Research 2016 No. 17(63) P. 1–53

We derive two convergence results for a sequential alternating maximization procedure to approximate the maximizer of random functionals such as the realized log likelihood in MLE estimation. We manage to show that the sequence attains the same deviation properties as shown for the profile M-estimator by Andresen and Spokoiny (2013), that means a finite sample ...

Added: September 8, 2016

Тематические модели: добавление биграмм и учет сходства между униграммами и биграммами

Nokel M., Loukachevitch N. V., Вычислительные методы и программирование 2015 Т. 16 № 2 С. 215–234

The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. ...

Added: March 15, 2016

Метод учёта структуры биграмм в тематических моделях

Nokel M., Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии 2014 № 4 С. 89–97

The paper presents the results of experimental study of integrating word similarity and bigram collocations into topic models. First of all, we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. Then we propose a modification of the original algorithm PLSA, which takes into account similar unigrams and ...

Added: March 15, 2016

Reconstruction of prograde and retrograde Chandler excitation

Zotov L., Bizouard C., Journal of Inverse and Ill-posed problems 2015 Vol. 24 No. 1 P. 99–105

Observed polar motion consists of uniform circular motions at both positive (prograde) and negative (retrograde) frequencies. Generalized Euler–Liouville equations of Bizouard, taking into account Earth's triaxiality and asymmetry of the ocean tide, show that the corresponding retrograde and prograde circular excitations are coupled at any frequency. In this work, we reconstructed the polar motion excitation ...

Added: September 30, 2015

Shape Perception

Sawada T., Li Y., Pizlo Z., , in: The Oxford Handbook of Computational and Mathematical Psychology.: Oxford University Press, 2015. P. 255–276.

This chapter provides a review of topics and concepts that are necessary to study and understand 3D shape perception. This includes group theory and their invariants; model-based invariants; Euclidean, affine, and projective geometry; symmetry; inverse problems; simplicity principle; Fechnerian psychophysics; regularization theory; Bayesian inference; shape constancy and shape veridicality; shape recovery; perspective and orthographic projections; ...

Added: March 10, 2015

Additive Regularization of Topic Models

Vorontsov K. V., Potapenko A., Machine Learning 2015 Vol. 101 No. 1 P. 303–323

Probabilistic topic modeling of text collections has been recently developed mainly within the framework of graphical models and Bayesian inference. In this paper we introduce an alternative semi-probabilistic approach, which we call additive regularization of topic models (ARTM). Instead of building a purely probabilistic generative model of text we regularize an ill-posed problem of stochastic matrix factorization ...

Added: February 19, 2015

Модификации EM-алгоритма для вероятностного тематического моделирования

Vorontsov K. V., Potapenko A., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657–686

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The ...

Added: February 19, 2015

Регуляризация, робастность и разреженность вероятностных тематических моделей

Vorontsov K. V., Potapenko A., Компьютерные исследования и моделирование 2012 Т. 4 № 4 С. 693–706

We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model ...

Added: February 19, 2015

Советский учитель на фоне школьной повести: корпусная перспектива

Маслинский К. А., Детские чтения 2014 Т. 6 № 2 С. 112–126

The aim of this article is to analyze the discursive background for the characters of teachers in the Soviet school story of the afterwar period. The 1,8 million words corpus for the study was compiled of the novels about school and schooling by 37 authors, written in 1940-s — 1980-s. The contents of the episodes ...

Added: January 17, 2015