Модификации EM-алгоритма для вероятностного тематического моделирования

К. В. Воронцов; А. А. Потапенко

?

Модификации EM-алгоритма для вероятностного тематического моделирования

Машинное обучение и анализ данных. 2013. Т. 1. № 6. С. 657-686.

Vorontsov K. V., Potapenko A.

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The known models PLSA (probabilistic latent semantic analysis), LDA (latent Dirichlet allocation), SWB (special words with background), as well as new ones can be considered as special cases of the presented broad family of models. A new simple robust algorithm suitable for sparse models that do not require to estimate and store a big matrix of noise parameters is proposed. The present authors nd experimentally optimal combinations of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet smoothing performs very well and gives more than 99% of zeros in multinomial distributions without loss of perplexity.

Priority areas: IT and mathematics

Language: Russian

Full text

Keywords: EM-алгоритм LDA EM-algorithm латентное размещение Дирихле bayesian inference байесовский вывод вероятностный латентный семантический анализ PLSA вероятностное тематическое моделирование probabilistic topic modeling

Регуляризация, робастность и разреженность вероятностных тематических моделей

Vorontsov K. V., Potapenko A., Компьютерные исследования и моделирование 2012 Т. 4 № 4 С. 693-706

We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model ...

Added: February 19, 2015

Модификации EM-алгоритма для вероятностного тематического моделирования

К.В. Воронцов, Потапенко А. А., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686

Added: May 6, 2014

Аддитивная регуляризация тематических моделей

Vorontsov K. V., Потапенко А. А., Доклады Академии наук 2014 Т. 456 № 3 С. 268-271

Вероятностное тематическое моделирование коллекций текстовых документов развивается в настоящее время, главным образом, в рамках байесовского подхода и графических моделей. В данной работе предлагается альтернативный подход, свободный от избыточных вероятностных предположений. Аддитивная регуляри зация тематических моделей (ARTM) основана на максимизации взвешенной сум мы логарифма правдоподобия и дополнительных критериев регуляризаторов. Это упрощает комбинирование тематических моделей и построение сколь угод но сложных многоцелевых моделей. ...

Added: December 5, 2014

Topic modelling for qualitative studies

Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova, Journal of Information Science 2017 Vol. 43 No. 1 P. 88-102

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along ...

Added: October 7, 2016

Социальные и культурные факторы как предикторы коммуникативного невербального поведения

Chistyakov I., СМАЛЬТА 2014 № 6 С. 37-39

Overall purpose of this article is analysis of relationship between sociocultural aspects of personality and ways of use this information in prediction of non-verbal behavior. Nowadays, human agent interaction is very actual problem. Current study reveals some ways to define culture and use this in machine learning. We analyzed some papers about modeling of behavior ...

Added: December 7, 2017

Application of Kalman Filter with alpha-stable distibution

Mozgunov P., , in : COMPSTAT 2014. 21st International Conference on Computational Statistics hosting the 5th IASC World Conference. Geneva, Switzerland, August 19–22, 2014. Book of Abstracts. : Geneva : [б.и.], 2014. P. 419-427.

In this paper we consider the behavior of Kalman Filter state estimates in the case of distribution with heavy tails .The simulated linear state space models with Gaussian measurement noises were used. Gaussian noises in state equation are replaced by components with alpha-stable distribution with different parameters alpha and beta. We consider the case when ...

Added: November 14, 2014

Многоклассовая модель формы со скрытыми переменными

Кириллов А. Н., Гавриков М. И., Lobacheva E. et al., Интеллектуальные системы. Теория и приложения 2015 Т. 19 № 2 С. 75-95

In this paper we consider the Shape Boltzmann Machine(SBM) and its multi-label version MSBM. We present an algorithm for training MSBM using only binary masks of objects and the seeds which approximately correspond to the locations of objects parts. ...

Added: September 30, 2015

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Konstantin Vorontsov, Anna Potapenko, , in : Communications in Computer and Information Science. Vol. 436: Analysis of Images, Social Networks and Texts. Third International Conference, AIST 2014 Yekaterinburg, Russia, April 10–12, 2014 Revised Selected Papers.: Cham : Springer, 2014. P. 29-46.

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models. ...

Added: December 5, 2014

Additive Regularization for Hierarchical Multimodal Topic Modeling

N. A. Chirkova, K. V. Vorontsov, Journal of machine learning and data analysis 2016 Vol. 2 No. 2 P. 187-200

Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation ...

Added: October 19, 2017

Fractal approach for determining the optimal number of topics in the field of topic modeling

Ignatenko V., Sergei Koltcov, Staab S. et al., Journal of Physics: Conference Series 2019 Vol. 1163 No. 1 P. 1-6

In the framework of this paper we apply multifractal formalism to the analysis of statistical behaviour of topic models under variation of the number of topics. Fractal analysis of topic models allows to show that self-similar fractal clusters exist in large textual collections. We provide numerical results for 3 topic models (PLSA, ARTM, LDA Gibbs sampling) on 2 datasets, ...

Added: November 30, 2018

Stable Topic Modeling with Local Density Regularization

Sergei Koltcov, Nikolenko S. I., Olessia Koltsova et al., , in : Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series. Vol. 9934.: Switzerland : Springer, 2016. P. 176-188.

Topic modeling has emerged over the last decade as a powerful tool for analyzing large text corpora, including Web-based user-generated texts. Topic stability, however, remains a concern: topic models have a very complex optimization landscape with many local maxima, and even different runs of the same model yield very different topics. Aiming to add stability ...

Added: October 7, 2016

Estimating Topic Modeling Performance with Sharma–Mittal Entropy

Koltsov S., Ignatenko V., Koltsova O., Entropy 2019 Vol. 21 No. 7 P. 1-29

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. ...

Added: July 5, 2019

Тематические модели: добавление биграмм и учет сходства между униграммами и биграммами

Nokel M., Loukachevitch N. V., Вычислительные методы и программирование 2015 Т. 16 № 2 С. 215-234

The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. ...

Added: March 15, 2016

Additive Regularization of Topic Models

Vorontsov K. V., Potapenko A., Machine Learning 2015 Vol. 101 No. 1 P. 303-323

Probabilistic topic modeling of text collections has been recently developed mainly within the framework of graphical models and Bayesian inference. In this paper we introduce an alternative semi-probabilistic approach, which we call additive regularization of topic models (ARTM). Instead of building a purely probabilistic generative model of text we regularize an ill-posed problem of stochastic matrix factorization ...

Added: February 19, 2015

Регуляризация вероятностных тематических моделей для повышения интерпретируемости и определения числа тем

Vorontsov K. V., Potapenko A., В кн. : Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 4 — 8 июня 2014 г.). Вып. 13(20).: М. : Изд-во РГГУ, 2014. С. 676-687.

Вероятностное тематическое моделирование — это современный инструмент статистического анализа текстов, предназначенный для выявления тематики коллекций документов. Задача построения тематической модели имеет бесконечно много решений, что приводит к неустойчивости и плохой интерпретируемости тем. Для решения этих проблем применяется новый многокритериальный подход — аддитивная регуляризация тематических моделей (ARTM). Вводятся четыре регуляризатора: для выделения слов общей лексики в ...

Added: December 5, 2014

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy

Koltsov S., Ignatenko V., Boukhers Z. et al., Entropy 2020 Vol. 22 No. 4 P. 1-13

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by ...

Added: April 1, 2020

Doubly Semi-Implicit Variational Inference

Molchanov D., Kharitonov V., Artem Sobolev et al., / Cornell University. Series arxiv.org "stat.ML". 2018.

We extend the existing framework of semi-implicit variational inference (SIVI) and introduce doubly semi-implicit variational inference (DSIVI), a way to perform variational inference and learning when both the approximate posterior and the prior distribution are semi-implicit. In other words, DSIVI performs inference in models where the prior and the posterior can be expressed as an ...

Added: October 21, 2018

Additive Regularization for Hierarchical Multimodal Topic Modeling

K. V. Vorontsov, Journal of machine learning and data analysis 2016 Vol. 2 No. 2 P. 187-200

Added: October 19, 2017

Analyzing the Influence of Hyper-parameters and Regularizers of Topic modeling in Terms of Renyi Entropy

Ignatenko V., Koltsov S., Staab S. et al., Physica A: Statistical Mechanics and its Applications 2019

Topic modeling is a popular approach for clustering text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from ...

Added: October 31, 2019

Renormalization approach to the task of determining the number of topics in topic modeling

Koltsov S., Ignatenko V., , in : Intelligent Computing: SAI 2020: Volume 1. * 1. Vol. 1228.: Switzerland : Springer, 2020. P. 234-247.

Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method ...

Added: November 11, 2019

Renormalization Analysis of Topic Models

Koltcov Sergei, Ignatenko V., Entropy 2020 Vol. 22 No. 5 P. 1-23

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic ...

Added: May 18, 2020

Multilevel classifiers based on a tree-structured set of Gaussian densities

N.A. Novikov, Pattern Recognition and Image Analysis 2014 Vol. 24 No. 3 P. 443-451

This paper considers an approach to solving the problem of binary classification of objects. This approach is based on representing one of the classes by a sequence of Gaussian mixtures with further introduction of threshold decision rules. A method of constructing hierarchical sequences of Gaussian mixtures using the partial EM algorithm is proposed. We compare ...

Added: January 16, 2015

Советский учитель на фоне школьной повести: корпусная перспектива

Маслинский К. А., Детские чтения 2014 Т. 6 № 2 С. 112-126

The aim of this article is to analyze the discursive background for the characters of teachers in the Soviet school story of the afterwar period. The 1,8 million words corpus for the study was compiled of the novels about school and schooling by 37 authors, written in 1940-s — 1980-s. The contents of the episodes ...

Added: January 17, 2015

Метод учёта структуры биграмм в тематических моделях

Nokel M., Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии 2014 № 4 С. 89-97

The paper presents the results of experimental study of integrating word similarity and bigram collocations into topic models. First of all, we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. Then we propose a modification of the original algorithm PLSA, which takes into account similar unigrams and ...

Added: March 15, 2016