Measuring Topic Quality in Latent Dirichlet Allocation
Topic modeling is an important direction of study for modern text mining; unsupervised mining of collections of topics is intended to produce understanding and capture the essence of issues a dataset is devoted to. However, existing techniques of topic evaluation in topic models such as latent Dirichlet allocation (LDA) are still lacking in their ability to represent human interpretability and worth for qualitative studies. In this work, we propose a novel topic quality metric that more closely corresponds to human judgement than existing ones. We support this claim with the results of an experimental study where test subjects rate LDA topics on how interpretable they are.
St. Petersburg: The Euler International Mathematical Institute, 2014
, , et al., , in: Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series. Vol. 9934.: Switzerland: Springer, 2016.. P. 176-188.
Topic modeling has emerged over the last decade as a powerful tool for analyzing large text corpora, including Web-based user-generated texts. Topic stability, however, remains a concern: topic models have a very complex optimization landscape with many local maxima, and even different runs of the same model yield very different topics. Aiming to add stability ...
Added: October 7, 2016
, , in: Knowledge Management in Organizations. 14th International Conference, KMO 2019, Zamora, Spain, July 15–18, 2019, Proceedings. Vol. 1027.: Switzerland: Springer, 2019.. P. 324-335.
The intellectual structure of academic discipline can be viewed as a set of interacting topics evolving over time. Dynamics of those topics i.e. changes in their popularity and impact is the subject of special attention because it reflects a shift in actual researchers’ interest. This paper analyzes topics of knowledge management (KM) on the base ...
Added: June 14, 2019
, , et al., , in: Proceedings of the 12th Mexican International Conference on Artificial Intelligence (MICAI 2013). * I: Advances in Artificial Intelligence and Its Applications.: Berlin: Springer, 2013.. P. 265-274.
An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose ...
Added: October 10, 2014
, , Policy & Internet 2013 Vol. 5 No. 2 P. 207-227
The purpose of this research is to describe the agenda set by the Internet-active part of the Russian public in Russia’s leading blog platform LiveJournal. This is done through modelling the Livejournal’s topic structure viewed as a reflection of online public opinion. Topic modelling is performed automatically with a LDA algorithm, and complemented with hand ...
Added: December 11, 2012
, , , Journal of Information Science 2017 Vol. 43 No. 1 P. 88-102
Qualitative studies, such as sociological research, opinion analysis and media studies, can benefit greatly from automated topic mining provided by topic models such as latent Dirichlet allocation (LDA). However, examples of qualitative studies that employ topic modelling as a tool are currently few and far between. In this work, we identify two important problems along ...
Added: October 7, 2016
, , , , in: Proceedings of WebSci '14 ACM Web Science Conference, Bloomington, IN, USA — June 23 - 26, 2014. .: NY: ACM, 2014.. P. 161-165.
Topic modeling, in particular the Latent Dirichlet Allocation (LDA) model, has recently emerged as an important tool for understanding large datasets, in particular, user-generated datasets in social studies of the Web. In this work, we investigate the instability of LDA inference, propose a new metric of similarity between topics and a criterion of vocabulary reduction. ...
Added: October 17, 2014
, , , Changes in the Topical Structure of Russian-Language Livejournal: The Impact of Elections 2011 / Высшая школа экономики. Series SOC "Sociology". 2013. No. 14.
This study investigates the topical structure of the Russian-language blog-publishing service LiveJournal and the change in it that occurred in the course of the public activity after the State Duma elections in December 2011 as compared to a previous “control” period (November 27 – December 27 and August 15 – September 15 respectively). The data ...
Added: February 1, 2013
, , , , in: Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020, Revised Selected Papers (LNCS 12602). Vol. 12602: LNCS.: Springer Publishing Company, 2021.. P. 69-81.
The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers ...
Added: October 7, 2020
Исследование дискурса о биопедагогике при помощи тематического моделирования и синтаксического анализа текстов
, , В кн.: Математическое и компьютерное моделирование [Электронный ресурс]: материалы IV Международной научной конференции (Омск, 11 ноября 2016 г.). .: Омск: Издательство Омского государственного университета, 2016.. С. 154-156.
В данной работе на материалах раздела о здоровом образе жизни блога lifehacker.ru при помощи тематического моделирования и синтаксического анализа текстов исследуется, как дискурс о биопедагогике проявляет себя в Интернете, какие лингвистические средства для этого используются и какие темы затрагиваются. ...
Added: November 25, 2016
, , et al., , in: ASE '20: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. .: ACM, 2020.. P. 1316-1320.
Added: October 26, 2021
, , , Proceedings 2020 Vol. 46 No. 1 P. 1-8
In practice, the critical step in building machine learning models of big data (BD) is costly in terms of time and the computing resources procedure of parameter tuning with a grid search. Due to the size, BD are comparable to mesoscopic physical systems. Hence, methods of statistical physics could be applied to BD. The paper ...
Added: March 12, 2020
, , , in: Digital Transformation and Global Society Third International Conference, DTGS 2018, St. Petersburg, Russia, May 30 –June 2, 2018, Revised Selected Papers, Part I. Issue 858.: Cham: Springer, 2018.. P. 181-194.
Internet regulation in Russia has vigorously expanded in recent years to transform the relatively free communication environment of the 2000s into a heavily regulated one. Our goal was to identify the topic structure of Russian media discourse on Internet regulation and compare it between political and non-political media outlets. We used structural topic modeling on ...
Added: October 10, 2018
, , , , in: Proceedings of WebSci '14 ACM Web Science Conference, Bloomington, IN, USA — June 23 - 26, 2014. .: NY: ACM, 2014.. P. 166-170.
In this paper we describe structural and topical properties of "ordinary" blogs versus "popular" blogs. Using the complete directory of the Russian language LiveJournal, we sample both groups and show that the main difference between them is in the volume of posting activity and of commenting feedback and in the skewedness of respective distributions. No ...
Added: October 8, 2014
, , , , in: Digital Transformation & Global Society: Second International Conference, DTGS 2017, St. Petersburg, Russia, June 21-23, 2017, Revised Selected Papers. .: Springer, 2017.. P. 341-346.
In this work in progress, we analyze how perceived hotel value dimensions and the perception of city sights are connected with categories of hotels. Applying a topic modelling algorithm to 21,165 reviews from 201 hotels located in Saint Petersburg, we show that clients of hotels of different categories pay attention to different value dimensions. Analyzing ...
Added: December 2, 2017
, Physica A: Statistical Mechanics and its Applications 2018 Vol. 512 P. 1192-1204
This study proposes to minimize Rényi and Tsallis entropies for finding the optimal number of topics T in topic modeling (TM). A promising tool to obtain knowledge about large text collections, TM is a method whose properties are underresearched; in particular, parameter optimization in such models has been hindered by the use of monotonous quality ...
Added: October 11, 2018
, , et al., , in: Lecture Notes in Computer Science. Vol. 8852: SocInfo 2014 International Workshops, Barcelona, Spain, November 11, 2014, Revised Selected Papers.: NY: Springer, 2015.. P. 52-55.
In this paper we explore main patterns of communication and cooperation in online groups created by residents of apartment buildings in St.Petersburg in social networking site “VK”. Using word-frequency analysis and Latent Dirichlet Allocation (LDA) we discovered main discussion topics in online groups. We have also found that communication of neighbors in these groups is ...
Added: November 7, 2014
, , , Computacion y Sistemas 2016 Vol. 20 No. 3 P. 387-403
Social studies of the Internet have adopte large-scale text mining for unsupervised discovery o topics related to specific subjects. A recently develope approach to topic modeling, additive regularizatio of topic models (ARTM), provides fast inference an more control over the topics with a wide variety o possible regularizers than developing LDA extensions We apply ARTM ...
Added: November 17, 2016
Who’s Bad? Attitudes Toward Resettlers From the Post-Soviet South Versus Other Nations in the Russian Blogosphere
, , et al., International Journal of Communication 2017 Vol. 11 P. 3242-3264
Communication in social media is increasingly being found to reproduce or even reinforce ethnic prejudice and hostility toward migrants. In Russia of the 2010s, with its world’s second largest immigrant population, polls have detected high levels of hostility of the Russian population toward migranty (migrants), a label attached to resettlers from Central Asia and the ...
Added: October 4, 2017
, , Машинное обучение и анализ данных 2013 Т. 1 № 6 С. 657-686
Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The ...
Added: February 19, 2015
, , et al., , in: WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference. .: Elsevier, 2016.. P. 342-343.
Topic modeling is a powerful tool for analyzing large collections of user-generated web content, but it still suffers from problems with topic stability, which are especially important for social sciences. We evaluate stability for differenttopic models and propose a new model, granulated LDA,that samples short sequences of neighboring words at once. We show that gLDA ...
Added: October 24, 2016
, , , , in: Digital Transformation & Global Society: Second International Conference, DTGS 2017, St. Petersburg, Russia, June 21-23, 2017, Revised Selected Papers. .: Springer, 2017.. P. 113-119.
In this paper, we analyse the strategies and stratification of Russian universities in the Northwestern region. By enriching traditional social network analysis scientometric tools, we developed web presence indicators focused on the contexts in which universities are linked with businesses and are mentioned in media. We treat resulting groups in terms of Gouldner’s cosmopolitans versus ...
Added: December 11, 2017
Analyzing the Influence of Hyper-parameters and Regularizers of Topic modeling in Terms of Renyi Entropy
, , et al., Physica A: Statistical Mechanics and its Applications 2019
Topic modeling is a popular approach for clustering text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from ...
Added: October 31, 2019
, , Журнал социологии и социальной антропологии 2020 Т. 23 № 2 С. 130-165
Th e study presents an attempt of the complex exploratory analysis of Russian rap based on the corpus of texts of the Russian-language songs of this genre. Th e corpus contains more than 11,000 texts that vary in their date of creation and popularity by more than 500 artists collected by automatically extracting data from web ...
Added: August 12, 2020