?
Модификации EM-алгоритма для вероятностного тематического моделирования
Probabilistic topic models discover a low-dimensional interpretable representation of text corpora
by estimating a multinomial distribution over topics for each document and a multinomial
distribution over terms for each topic. A unied family of expectation-maximization (EM) like
algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in
any combinations is considered. The known models PLSA (probabilistic latent semantic analysis),
LDA (latent Dirichlet allocation), SWB (special words with background), as well as new
ones can be considered as special cases of the presented broad family of models. A new simple robust
algorithm suitable for sparse models that do not require to estimate and store a big matrix
of noise parameters is proposed. The present authors nd experimentally optimal combinations
of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet
smoothing performs very well and gives more than 99% of zeros in multinomial distributions
without loss of perplexity.