Topic modelling for qualitative studies

Sergey Nikolenko; Sergei Koltcov; Olessia Koltsova

doi:10.1177/0165551515617393

Publications

?

Topic modelling for qualitative studies

Journal of Information Science. 2017. Vol. 43. No. 1. P. 88-102.

Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova

Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along the way to using topic models in qualitative studies: lack of a good quality metric that closely matches human judgement in understanding topics and the need to indicate specific subtopics that a specific qualitative study may be most interested in mining. For the first problem, we propose a new quality metric, tf-idf coherence, that reflects human judgement more accurately than regular coherence, and conduct an experiment to verify this claim. For the second problem, we propose an interval semi-supervised approach (ISLDA) where certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. Our experiments show that ISLDA is better for topic extraction than LDA in terms of tf-idf coherence, number of topics identified to predefined keywords and topic stability. We also present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.

Research target: Computer Science

Priority areas: IT and mathematics

Keywords: topic modeling LDA латентное размещение Дирихле тематическое моделирование Topic quality

Publication based on the results of:

Разработка концепции и методологии многоуровневого мониторинга состояния межнациональных отношений по данным социальных сетей (2017)

Разработка сервиса поиска экспертов для актуальных информационных событий

Karpov N., Shadrina E. V., Алгоритмы, методы и системы обработки данных 2015 № 4(33) С. 33-47

In this paper, we propose a new way to develop a service for sharing knowledge in the university cluster by searching for appropriate experts. The method is based on a modern approach to the search for experts with the help of topic modeling. The service has been implemented in the form of a decision support ...

Added: February 4, 2016

Stable Topic Modeling with Local Density Regularization

Sergei Koltcov, Nikolenko S. I., Olessia Koltsova et al., , in : Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series. Vol. 9934.: Switzerland : Springer, 2016. P. 176-188.

Topic modeling has emerged over the last decade as a powerful tool for analyzing large text corpora, including Web-based user-generated texts. Topic stability, however, remains a concern: topic models have a very complex optimization landscape with many local maxima, and even different runs of the same model yield very different topics. Aiming to add stability ...

Added: October 7, 2016

Do topics make a metaphor? Topic modeling for metaphor identification and analysis in Russian.

Badryzlova Y., Nikiforova A., Lyashevskaya O., , in : Analysis of Images, Social Networks and Texts: 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020, Revised Selected Papers. Vol. 12602.: Springer, 2021. P. 69-81.

The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers ...

Added: October 7, 2020

Robust PLSA Performs Better Than LDA

Konstantin Vorontsov, Potapenko A., Lecture Notes in Computer Science 2013 Vol. 7814 P. 784-787

In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which ...

Added: November 13, 2013

Proceedings of the 4th Global TechMining Conference (Leiden, Netherlands)

Leiden : [б.и.], 2014

The goal of the conference is to help build cross-disciplinary networks of analysts, software specialists, and researchers to advance the use of textual information in multiple science, technology, and business development fields. Within this context, conference themes will include, but are not limited to: Data Sourcing, preparing, and interpreting data sources including patents, publications, webscraping, and other ...

Added: October 23, 2014

Studying Patterns of Communication in Virtual Urban Groups With Different Modes of Privacy

Vadim Voskresenskiy, Musabirov I., Alexandrov D. A., / Высшая школа экономики. Series SOC "Sociology". 2017.

This paper is concerned with online communication of apartment buildings' residents on general purpose social networking site (SNS) VKontakte (VK), focusing on how groups' participants use instruments of SNS to separate place-based discussions and participation in wider community initiatives. With the help of topic modeling algorithm LDA, we analyzed posts collected from online groups related ...

Added: October 20, 2017

Gibbs Sampler Optimization for Analysis of a Granulated Medium

Koltsov S., Nikolenko S. I., Koltsova O., Письма в Журнал технической физики 2016 Т. 42 № 8 С. 837-839

A new variant of the method of probability density distribution recovery for solving topical modeling problems is described. Disadvantages of the Gibbs sampling algorithm are considered, and a modified variant, called the “granulated sampling method,” is proposed. Based on the results of statistical modeling, it is shown that the proposed algorithm is characterized by higher stability as ...

Added: July 26, 2016

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy

Koltsov S., Ignatenko V., Boukhers Z. et al., Entropy 2020 Vol. 22 No. 4 P. 1-13

Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by ...

Added: April 1, 2020

Renormalization Analysis of Topic Models

Koltcov Sergei, Ignatenko V., Entropy 2020 Vol. 22 No. 5 P. 1-23

In practice, to build a machine learning model of big data, one needs to tune model parameters. The process of parameter tuning involves extremely time-consuming and computationally expensive grid search. However, the theory of statistical physics provides techniques allowing us to optimize this process. The paper shows that a function of the output of topic ...

Added: May 18, 2020

Аддитивная регуляризация тематических моделей

Vorontsov K. V., Потапенко А. А., Доклады Академии наук 2014 Т. 456 № 3 С. 268-271

Вероятностное тематическое моделирование коллекций текстовых документов развивается в настоящее время, главным образом, в рамках байесовского подхода и графических моделей. В данной работе предлагается альтернативный подход, свободный от избыточных вероятностных предположений. Аддитивная регуляри зация тематических моделей (ARTM) основана на максимизации взвешенной сум мы логарифма правдоподобия и дополнительных критериев регуляризаторов. Это упрощает комбинирование тематических моделей и построение сколь угод но сложных многоцелевых моделей. ...

Added: December 5, 2014

Analyzing the Influence of Hyper-parameters and Regularizers of Topic modeling in Terms of Renyi Entropy

Ignatenko V., Koltsov S., Staab S. et al., Physica A: Statistical Mechanics and its Applications 2019

Topic modeling is a popular approach for clustering text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from ...

Added: October 31, 2019

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

Koltsov S., Surkov A., Filippov V. et al., PeerJ Computer Science (CША) 2024 Vol. 10 P. 1-41

Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting ...

Added: January 23, 2024

Analysis and tuning of hierarchical topic models based on Renyi entropy approach

Koltsov S., Ignatenko V., Terpilowski M. et al., PeerJ Computer Science 2021 Vol. 7 Article e608

Hierarchical topic modeling is a potentially powerful instrument for determining topical structures of text collections that additionally allows constructing a hierarchy representing the levels of topic abstractness. However, parameter optimization in hierarchical models, which includes finding an appropriate number of topics at each level of hierarchy, remains a challenging task. In this paper, we propose ...

Added: September 3, 2021

Регуляризация, робастность и разреженность вероятностных тематических моделей

Vorontsov K. V., Potapenko A., Компьютерные исследования и моделирование 2012 Т. 4 № 4 С. 693-706

We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model ...

Added: February 19, 2015

Application of Rényi and Tsallis entropies to topic modeling optimization

Koltsov S., Physica A: Statistical Mechanics and its Applications 2018 Vol. 512 P. 1192-1204

This study proposes to minimize Rényi and Tsallis entropies for finding the optimal number of topics T in topic modeling (TM). A promising tool to obtain knowledge about large text collections, TM is a method whose properties are underresearched; in particular, parameter optimization in such models has been hindered by the use of monotonous quality ...

Added: October 11, 2018

Модификации EM-алгоритма для вероятностного тематического моделирования

Vorontsov K. V., Potapenko A., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686

Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The ...

Added: February 19, 2015

Исследование дискурса о биопедагогике при помощи тематического моделирования и синтаксического анализа текстов

Nagornyy O. S., Мухетдинова А. Т., В кн. : Математическое и компьютерное моделирование [Электронный ресурс]: материалы IV Международной научной конференции (Омск, 11 ноября 2016 г.). : Омск : Издательство Омского государственного университета, 2016. С. 154-156.

В данной работе на материалах раздела о здоровом образе жизни блога lifehacker.ru при помощи тематического моделирования и синтаксического анализа текстов исследуется, как дискурс о биопедагогике проявляет себя в Интернете, какие лингвистические средства для этого используются и какие темы затрагиваются. ...

Added: November 25, 2016

Fast Tuning of Topic Models: An Application of Rényi Entropy and Renormalization Theory

Sergei Koltcov, Ignatenko V., Pashakhin S., Proceedings 2020 Vol. 46 No. 1 P. 1-8

In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper ...

Added: March 12, 2020

Тематическое моделирование для коротких текстов: сравнительный анализ

Vashchenko V., Социология: методология, методы, математическое моделирование 2023 № 56 С. 1-20

The steady increase in the popularity of social media as a means of communication actualizes methodological issues related to processing of short texts with less semantic context than large corpora, which are widely used for training and testing machine learning models for textual data. Topic modeling, an unsupervised machine learning technique aimed at aggregating texts ...

Added: December 7, 2023

Модификации EM-алгоритма для вероятностного тематического моделирования

К.В. Воронцов, Потапенко А. А., Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686

Added: May 6, 2014

Particle Simulation for Predicting Effective Properties of Short Fiber Reinforced Composites

Skoptsov K. A., Sheshenin S., Galatenko V. V. et al., International Journal of Applied Mechanics 2016 Vol. 8 No. 2 P. 1650016-01-1650016-18

We present a method for evaluating elastic properties of a composite material produced by molding a resin filled with short elastic fibers. A flow of the filled resin is simulated numerically using a mesh-free method. After that, assuming that spatial distribution and orientation of fibers are not significantly changed during polymerization, effective elastic moduli of ...

Added: May 22, 2016

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.). Вып. 18 (25)

М. : Издательский центр «Российский государственный гуманитарный университет», 2019

Сборник включает 27 докладов международной конференции по компьютерной лингвистике и интеллектуальным технологиям «Диалог 2019», не вошедшие в ежегодник «Компьютерная лингвистика и интеллектуальные технологии», но рекомендованные Программным Комитетом к представлению на конференции. Для специалистов в области теоретической и прикладной лингвистики и интеллектуальных технологий. ...

Added: December 10, 2019

Algorithms and methods for solving scheduling problems and other extremum problems on large-scale graphs

Chernyshev S. V., Cherepanov E. A., Pankratiev E. V. et al., Journal of Mathematical Sciences 2005 Vol. 128 No. 6 P. 3487-3495

Added: January 27, 2014

Программный комплекс моделирования физических процессов при автоматизированном проектировании источников вторичного электропитания для сложных бортовых систем

Sotnikova S., Динамика сложных систем 2012 № 3 С. 84-87

In article is described designed programme complex of the physical processes modeling, which also allows to conduct the identification printed node parameters (the physical model). On printed node designed the on-board secondary power supply source is realized. For it are designed relationship interfaces of controlling program with the known program of modeling and optimization. ...

Added: December 5, 2014