?
Тематическое моделирование для коротких текстов: сравнительный анализ
The steady increase in the popularity of social media as a means of communication actualizes methodological issues related to processing of short texts with less semantic context than large corpora, which are widely used for training and testing machine learning models for textual data. Topic modeling, an unsupervised machine learning technique aimed at aggregating texts into topic clusters, has many academic and practical applications where information on true groupings of texts is not available. However, the performance of topic modeling algorithms may be limited by requirement of a sufficient semantic context for a high-quality numerical representation of a unit of text, which may not be derived effectively from a short document. This paper discusses 3 different approaches to topic modeling: classical LDA enriched with pre-trained word embeddings, topic modeling based on the BERT transformer model, and a network-based approach to topic modeling using stochastic blockmodels. We compare the performance of the above algorithms on a set of Russian-language comments on TikTok and formally evaluate their performance based on speed and coherence of the resulting topics.