Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy

S. Koltsov; V. Ignatenko; Boukhers Z.; Staab S.

doi:10.3390/e22040394

Publications

?

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy

Entropy. 2020. Vol. 22. No. 4. P. 1–13.

Koltsov S., Ignatenko V., Boukhers Z., Staab S.

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research.

Research target: Computer Science

Priority areas: IT and mathematics

Keywords: регуляризация topic modeling regularization вероятностное тематическое моделирование Renyi entropy Энтропия Реньи

Publication based on the results of:

Online communication: cognitive limits and methods of automatic analysis (2020)

Analyzing the Influence of Hyper-parameters and Regularizers of Topic modeling in Terms of Renyi Entropy

Ignatenko V., Koltsov S., Staab S. et al., Physica A: Statistical Mechanics and its Applications 2019

Topic modeling is a popular approach for clustering text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from ...

Added: October 31, 2019

Renormalization Analysis of Topic Models

Koltcov Sergei, Ignatenko V., Entropy 2020 Vol. 22 No. 5 P. 1–23

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic ...

Added: May 18, 2020

Analysis and tuning of hierarchical topic models based on Renyi entropy approach

Koltsov S., Ignatenko V., Terpilowski M. et al., PeerJ Computer Science 2021 Vol. 7 Article e608

Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose ...

Added: September 3, 2021

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

Koltsov S., Surkov A., Filippov V. et al., PeerJ Computer Science (CША) 2024 Vol. 10 P. 1–41

Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting ...

Added: January 23, 2024

КАРТИРОВАНИЕ НЕДОСТУПНЫХ ЗДАНИЙ МЕТОДОМ РАДИОТОМОГРАФИИ

Ingacheva A., Кохан В. В., Ershov E. et al., Сенсорные системы 2018 Т. 32 № 4 С. 332–341

In this paper we consider the task of inner objects mapping for the building with a bunch of moving around it autonomous agents which use narrow beam of radio waves using WiFi frequency (2.4 GHz). Linear model of pixel-wise radio waves attenuation is considered. SIRT algorithm with TV and Tikhonov regularizations is used for the ...

Added: February 9, 2020

On a regularization of the magnetic gas dynamics system of equations

Zlotnik Alexander, Ducomet B., Kinetic and Related Models 2013 Vol. 6 No. 3 P. 533–543

A brief derivation of a specific regularization for the magnetic gas dynamic system of equations is given in the case of general equations of gas state (in presence of a body force and a heat source). The entropy balance equation in two forms is also derived for the system. For a constant regularization parameter and ...

Added: September 27, 2013

Application of Rényi and Tsallis entropies to topic modeling optimization

Koltsov S., Physica A: Statistical Mechanics and its Applications 2018 Vol. 512 P. 1192–1204

This study proposes to minimize Rényi and Tsallis entropies for finding the optimal number of topics T in topic modeling (TM). A promising tool to obtain knowledge about large text collections, TM is a method whose properties are underresearched; in particular, parameter optimization in such models has been hindered by the use of monotonous quality ...

Added: October 11, 2018

Topic modelling for qualitative studies

Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova, Journal of Information Science 2017 Vol. 43 No. 1 P. 88–102

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along ...

Added: October 7, 2016

Robust PLSA Performs Better Than LDA

Konstantin Vorontsov, Potapenko A., Lecture Notes in Computer Science 2013 Vol. 7814 P. 784–787

In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which ...

Added: November 13, 2013

Estimating Topic Modeling Performance with Sharma–Mittal Entropy

Koltsov S., Ignatenko V., Koltsova O., Entropy 2019 Vol. 21 No. 7 P. 1–29

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. ...

Added: July 5, 2019

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

Sergei Koltcov, Surkov A., Filippov V. et al., PeerJ Computer Science 2024 Vol. 10 P. 41

Added: February 16, 2024

Аддитивная регуляризация тематических моделей

Vorontsov K. V., Потапенко А. А., Доклады Академии наук 2014 Т. 456 № 3 С. 268–271

Вероятностное тематическое моделирование коллекций текстовых документов развивается в настоящее время, главным образом, в рамках байесовского подхода и графических моделей. В данной работе предлагается альтернативный подход, свободный от избыточных вероятностных предположений. Аддитивная регуляри зация тематических моделей (ARTM) основана на максимизации взвешенной сум мы логарифма правдоподобия и дополнительных критериев регуляризаторов. Это упрощает комбинирование тематических моделей и построение сколь угод но сложных многоцелевых моделей. ...

Added: December 5, 2014

On some properties of multidimensional hyperbolic quasi-gasdynamic systems of equations

Chetverushkin B. N., Zlotnik A.A., Russian Journal of Mathematical Physics 2017 Vol. 24 No. 3 P. 299–309

We study a multidimensional hyperbolic quasi-gasdynamic (HQGD) system of equations containing terms with a regularizing parameter $\tau>0$ and 2nd order space and time derivatives; the body force is taken into account. We transform it to the form close to the compressible Navier-Stokes system of equations. Then we derive the entropy balance equation and show that ...

Added: July 19, 2017

Приближенное оценивание с помощью ускоренного метода наибольшей энтропии. Часть 2. исследование свойств оценок часть

Dubnov Y. A., Bulychev A., Информационные технологии и вычислительные системы 2023 № 1 С. 71–81

In this paper, we investigate a method of approximate entropy estimation, designed to speed up the classical method of maximum entropy estimation due to the use of regularization in the optimization problem. This method is compared with the method of maximum likelihood and Bayesian estimation, both experimentally and in terms of theoretical calculations for some ...

Added: June 16, 2023

Computer Vision – ECCV 2018. 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XII

Cham: Springer, 2018.

The sixteen-volume set comprising the LNCS volumes 11205-11220 constitutes the refereed proceedings of the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018. The 776 revised papers presented were carefully reviewed and selected from 2439 submissions. The papers are organized in topical sections on learning for vision; computational photography; ...

Added: October 31, 2018

(1 + ε)-class Classification: an Anomaly Detection Method for Highly Imbalanced or Incomplete Data Sets

Maxim Borisyak, Artem Ryzhikov, Ustyuzhanin A. et al., Journal of Machine Learning Research 2020 Vol. 21 P. 1–22

Anomaly detection is not an easy problem since distribution of anomalous samples is unknown a priori. We explore a novel method that gives a trade-off possibility between one-class and two-class approaches, and leads to a better performance on anomaly detection problems with small or non-representative anomalous samples. The method is evaluated using several data sets ...

Added: March 13, 2020

On a regularization of the magnetic gas dynamics system of equations

Ducomet B., Zlotnik A., / Series math "arxiv.org". 2012. No. arXiv:1211.3539 [math.AP].

Added: January 25, 2013

Proceedings of the 4th Global TechMining Conference (Leiden, Netherlands)

Leiden: [б.и.], 2014.

The goal of the conference is to help build cross-disciplinary networks of analysts, software specialists, and researchers to advance the use of textual information in multiple science, technology, and business development fields. Within this context, conference themes will include, but are not limited to: Data Sourcing, preparing, and interpreting data sources including patents, publications, webscraping, and other ...

Added: October 23, 2014

Fast Tuning of Topic Models: An Application of Rényi Entropy and Renormalization Theory

Sergei Koltcov, Ignatenko V., Pashakhin S., Proceedings 2020 Vol. 46 No. 1 P. 1–8

In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper ...

Added: March 12, 2020

Fractal approach for determining the optimal number of topics in the field of topic modeling

Ignatenko V., Sergei Koltcov, Staab S. et al., Journal of Physics: Conference Series 2019 Vol. 1163 No. 1 P. 1–6

In the framework of this paper we apply multifractal formalism to the analysis of statistical behaviour of topic models under variation of the number of topics. Fractal analysis of topic models allows to show that self-similar fractal clusters exist in large textual collections. We provide numerical results for 3 topic models (PLSA, ARTM, LDA Gibbs sampling) on 2 datasets, ...

Added: November 30, 2018

Разработка сервиса поиска экспертов для актуальных информационных событий

Karpov N., Shadrina E. V., Алгоритмы, методы и системы обработки данных 2015 № 4(33) С. 33–47

In this paper, we propose a new way to develop a service for sharing knowledge in the university cluster by searching for appropriate experts. The method is based on a modern approach to the search for experts with the help of topic modeling. The service has been implemented in the form of a decision support ...

Added: February 4, 2016

Probably approximately correct learning of Horn envelopes from queries

Borchmann D., Hanika T., Obiedkov S., Discrete Applied Mathematics 2020 Vol. 273 P. 30–42

We propose an algorithm for learning the Horn envelope of an arbitrary domain using an expert, or an oracle, capable of answering certain types of queries about this domain. Attribute exploration from formal concept analysis is a procedure that solves this problem, but the number of queries it may ask is exponential in the size ...

Added: October 29, 2019

Proceedings of 11th Industrial Conference on Data Mining (ICDM 2012)

Springer, 2012.

Added: January 29, 2013

Моделирование сетей на кристалле на основе регулярных и квазиоптимальных топологий с помощью симулятора OCNS

Romanov A., Tumkovskiy S., Иванова Г. А., Вестник РГРТУ 2015 Т. 2 № 52 С. 61–66

A review of the networks-on-chip modeling methods is given. A high-level model of networks-on-chip based on the programming language Java, which helps to accelerate the modeling process by several orders, compared to HDL‑models is developed. The results of simulation of networks-on-chip based on regular and quasi-optimal topologies with the number of nodes up to 100 ...

Added: June 21, 2015