?
Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics
Topic modeling is a widely used instrument for the analysis of large text collections.
In the last few years, neural topic models and models with word embeddings have
been proposed to increase the quality of topic solutions. However, these models
were not extensively tested in terms of stability and interpretability. Moreover, the
question of selecting the number of topics (a model parameter) remains a challenging
task. We aim to partially fill this gap by testing four well-known and available to
a wide range of users topic models such as the embedded topic model (ETM),
Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet
prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTMGMM).
We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability
that complicates their application in practice. ETM model with additionally trained
embeddings demonstrates high coherence and rather good stability for large datasets,
but the question of the number of topics remains unsolved for this model. We also
propose a new topic model based on granulated sampling with word embeddings
(GLDAW), demonstrating the highest stability and good coherence compared to
other considered models. Moreover, the optimal number of topics in a dataset can
be determined for this model.