Регуляризация, робастность и разреженность вероятностных тематических моделей

К. В. Воронцов; А. А. Потапенко

?

Регуляризация, робастность и разреженность вероятностных тематических моделей

Компьютерные исследования и моделирование. 2012. Т. 4. № 4. С. 693-706.

Vorontsov K. V., Potapenko A.

We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model and show that it is more sparse and performs better that regularized models like LDA.

Priority areas: IT and mathematics

Language: Russian

Text on another site

Keywords: робастность EM-алгоритм LDA robustness EM-algorithm латентное размещение Дирихле вероятностный латентный семантический анализ PLSA вероятностное тематическое моделирование probabilistic topic modeling

Модификации EM-алгоритма для вероятностного тематического моделирования

Vorontsov K. V., Potapenko A., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The ...

Added: February 19, 2015

Модификации EM-алгоритма для вероятностного тематического моделирования

К.В. Воронцов, Потапенко А. А., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686

Added: May 6, 2014

Аддитивная регуляризация тематических моделей

Vorontsov K. V., Потапенко А. А., Доклады Академии наук 2014 Т. 456 № 3 С. 268-271

Вероятностное тематическое моделирование коллекций текстовых документов развивается в настоящее время, главным образом, в рамках байесовского подхода и графических моделей. В данной работе предлагается альтернативный подход, свободный от избыточных вероятностных предположений. Аддитивная регуляри зация тематических моделей (ARTM) основана на максимизации взвешенной сум мы логарифма правдоподобия и дополнительных критериев регуляризаторов. Это упрощает комбинирование тематических моделей и построение сколь угод но сложных многоцелевых моделей. ...

Added: December 5, 2014

Устойчивый к шуму метод обучения вариационного автокодировщика

Figurnov M., Struminsky K., Vetrov D., Интеллектуальные системы. Теория и приложения 2017 Т. 21 № 2 С. 90-109

Variational autoencoder (VAE) is a probabilistic unsupervised method that uses deep learning. We propose a robust approach to the training of VAE using a modified likelihood function. We propose and analyze two variational lower bound objectives. The effectiveness of the method is experimentally shown by artificially introducing noise objects. ...

Added: October 18, 2017

Topic modelling for qualitative studies

Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova, Journal of Information Science 2017 Vol. 43 No. 1 P. 88-102

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along ...

Added: October 7, 2016

Устойчивость и процессы управления: Материалы III международной конференции (Санкт-Петербург, 5-9 октября 2015 г.)

СПб. : Издательский дом Федоровой Г.В., 2015

Proceedings of the III International Conference in memory of V.I. Zubov "Stability and Control Processes (SCP 2015)". ...

Added: October 14, 2015

Residual empirical processes and qualitatively robust GM-tests in autoregression

Boldin M. V., Esaulov D., Moscow University Mathematics Bulletin 2014 Vol. 69 No. 1 P. 29-32

The local qualitative robustness of GM-tests against outliers in the autoregression model is studied in the paper. A local scheme of data contamination by independent outliers with the intensity O(n−1/2) is considered. The qualitative robustness in terms of power equicontinuity is obtained. The GM-tests asymptotically optimal in the maximin sense are constructed. ...

Added: October 17, 2016

Robustness of GM-tests in autoregression against outliers

Esaulov D., Moscow University Mathematics Bulletin 2012 Vol. 67 No. 2 P. 79-81

The paper deals with properties of GM-estimators and GM-tests for linear hypotheses in AR(p)-processes when observations contain outliers. In particular, we obtain the marginal distribution of test statistics, which allows us to prove the robustness of these GM-tests. The scheme of data contamination by additive single outliers with the intensity O(n−1/2), where n is the data level, is ...

Added: October 19, 2016

Robust PLSA Performs Better Than LDA

Konstantin Vorontsov, Potapenko A., Lecture Notes in Computer Science 2013 Vol. 7814 P. 784-787

In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which ...

Added: November 13, 2013

Multilevel classifiers based on a tree-structured set of Gaussian densities

N.A. Novikov, Pattern Recognition and Image Analysis 2014 Vol. 24 No. 3 P. 443-451

This paper considers an approach to solving the problem of binary classification of objects. This approach is based on representing one of the classes by a sequence of Gaussian mixtures with further introduction of threshold decision rules. A method of constructing hierarchical sequences of Gaussian mixtures using the partial EM algorithm is proposed. We compare ...

Added: January 16, 2015

Советский учитель на фоне школьной повести: корпусная перспектива

Маслинский К. А., Детские чтения 2014 Т. 6 № 2 С. 112-126

The aim of this article is to analyze the discursive background for the characters of teachers in the Soviet school story of the afterwar period. The 1,8 million words corpus for the study was compiled of the novels about school and schooling by 37 authors, written in 1940-s — 1980-s. The contents of the episodes ...

Added: January 17, 2015

Оценка эффективности инвестиционных проектов ранних стадий на основе анализа робастности

Chaprak N., В кн. : Материалы Международного молодежного научного форума "ЛОМОНОСОВ-2015". : М. : МАКС Пресс, 2015. С. 102-105.

В статье представлен метод оценки пректов ранних стадий на основе анализа робастности по следующим критериям: вероятность реализации определенного исхода, вероятностное ранговое математическое ожидания исходов проекта и стресс-тестовое пространство. ...

Added: April 25, 2015

Renormalization Analysis of Topic Models

Koltcov Sergei, Ignatenko V., Entropy 2020 Vol. 22 No. 5 P. 1-23

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic ...

Added: May 18, 2020

Stable Topic Modeling with Local Density Regularization

Sergei Koltcov, Nikolenko S. I., Olessia Koltsova et al., , in : Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series. Vol. 9934.: Switzerland : Springer, 2016. P. 176-188.

Topic modeling has emerged over the last decade as a powerful tool for analyzing large text corpora, including Web-based user-generated texts. Topic stability, however, remains a concern: topic models have a very complex optimization landscape with many local maxima, and even different runs of the same model yield very different topics. Aiming to add stability ...

Added: October 7, 2016

Renormalization approach to the task of determining the number of topics in topic modeling

Koltsov S., Ignatenko V., , in : Intelligent Computing: SAI 2020: Volume 1. * 1. Vol. 1228.: Switzerland : Springer, 2020. P. 234-247.

Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method ...

Added: November 11, 2019

ROBUSTNESS MEASURE FOR PORTFOLIO MANAGEMENT STRATEGY

Sharipova A., Арьков В. Ю., Вестник Южно-Уральского государственного университета. Серия: Компьютерные технологии, управление, радиоэлектроника 2017 Vol. 17 No. 3 P. 88-98

A practical approach to estimating of the investment strategy robustness is presented. As a quantitative measure of robustness, the objective function smoothness degree is proposed for utilization. After the optimization has been conducted, it is essential to utilize an additional criterion for the selection of strategies that possess better robustness property. The utilization of the ...

Added: October 17, 2020

Analyzing the Influence of Hyper-parameters and Regularizers of Topic modeling in Terms of Renyi Entropy

Ignatenko V., Koltsov S., Staab S. et al., Physica A: Statistical Mechanics and its Applications 2019

Topic modeling is a popular approach for clustering text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from ...

Added: October 31, 2019

Additive Regularization for Hierarchical Multimodal Topic Modeling

K. V. Vorontsov, Journal of machine learning and data analysis 2016 Vol. 2 No. 2 P. 187-200

Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation ...

Added: October 19, 2017

Регуляризация вероятностных тематических моделей для повышения интерпретируемости и определения числа тем

Vorontsov K. V., Potapenko A., В кн. : Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 4 — 8 июня 2014 г.). Вып. 13(20).: М. : Изд-во РГГУ, 2014. С. 676-687.

Вероятностное тематическое моделирование — это современный инструмент статистического анализа текстов, предназначенный для выявления тематики коллекций документов. Задача построения тематической модели имеет бесконечно много решений, что приводит к неустойчивости и плохой интерпретируемости тем. Для решения этих проблем применяется новый многокритериальный подход — аддитивная регуляризация тематических моделей (ARTM). Вводятся четыре регуляризатора: для выделения слов общей лексики в ...

Added: December 5, 2014

Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization

Konstantin Vorontsov, Anna Potapenko, , in : Communications in Computer and Information Science. Vol. 436: Analysis of Images, Social Networks and Texts. Third International Conference, AIST 2014 Yekaterinburg, Russia, April 10–12, 2014 Revised Selected Papers.: Cham : Springer, 2014. P. 29-46.

Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models. ...

Added: December 5, 2014

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy

Koltsov S., Ignatenko V., Boukhers Z. et al., Entropy 2020 Vol. 22 No. 4 P. 1-13

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by ...

Added: April 1, 2020

Fractal approach for determining the optimal number of topics in the field of topic modeling

Ignatenko V., Sergei Koltcov, Staab S. et al., Journal of Physics: Conference Series 2019 Vol. 1163 No. 1 P. 1-6

In the framework of this paper we apply multifractal formalism to the analysis of statistical behaviour of topic models under variation of the number of topics. Fractal analysis of topic models allows to show that self-similar fractal clusters exist in large textual collections. We provide numerical results for 3 topic models (PLSA, ARTM, LDA Gibbs sampling) on 2 datasets, ...

Added: November 30, 2018

Остаточные эмпирические процессы и качественно робастные GM-тесты в авторегрессии

Болдин М. В., Esaulov D., Вестник Московского университета. Серия 1: Математика. Механика 2014 № 1 С. 46-50

The local qualitative robustness of GM-tests against outliers in the autoregression model is studied in the paper. A local scheme of data contamination by independent outliers with the intensity O(n -1/2) is considered. The qualitative robustness in terms of power equicontinuity is obtained. The GM-tests asymptotically optimal in the maximin sense are constructed. ...

Added: October 17, 2016

Estimating Topic Modeling Performance with Sharma–Mittal Entropy

Koltsov S., Ignatenko V., Koltsova O., Entropy 2019 Vol. 21 No. 7 P. 1-29

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. ...

Added: July 5, 2019