Регуляризация, робастность и разреженность вероятностных тематических моделей
We propose a generalized probabilistic topic model of text corpora which can incorporate heuristics of Bayesian regularization, sampling, frequent parameters update, and robustness in any combinations. Well- known models PLSA, LDA, CVB0, SWB, and many others can be considered as special cases of the proposed broad family of models. We propose the robust PLSA model and show that it is more sparse and performs better that regularized models like LDA.
We study the problem of testing composite hypotheses versus composite alternatives when there is a slight deviation between the model and the real distribution. The used approach, which we called sub-optimal testing, implies an extension of the initial model and a modification of a sequential statistically significant test for the new model. The sub-optimal test is proposed and a non-asymptotic border for the loss function is obtained. Also we investigate correlation between the sub-optimal test and the sequential probability ratio test for the initial model.
Proceedings of the III International Conference in memory of V.I. Zubov "Stability and Control Processes (SCP 2015)".
An important text mining problem is to find, in a large collection of texts, documents related to specific topics and then discern further structure among the found texts. This problem is especially important for social sciences, where the purpose is to find the most representative documents for subsequent qualitative interpretation. To solve this problem, we propose an interval semi-supervised LDA approach, in which certain predefined sets of keywords (that define the topics researchers are interested in) are restricted to specific intervals of topic assignments. We present a case study on a Russian LiveJournal dataset aimed at ethnicity discourse analysis.
Topic modeling, in particular the Latent Dirichlet Allocation (LDA) model, has recently emerged as an important tool for understanding large datasets, in particular, user-generated datasets in social studies of the Web. In this work, we investigate the instability of LDA inference, propose a new metric of similarity between topics and a criterion of vocabulary reduction. We show the limitations of the LDA approach for the purposes of qualitative analysis in social science and sketch some ways for improvement.
We present robustness of the firm as an uninterrupted exchange of resources between the firm and owners of resources - stakeholders. We derive the model on the mutually accepted conditions of exchanges for the major resources and indicate the firm's limits to manipulate the exchange conditions. We also argue that temporary benevolent behavior of the firms towards one or several its stakeholders leads to accumulation of stakeholders' quasi-rent and contributes to the overall robustness of the firm.
In this paper we introduce a generalized learning algorithm for probabilistic topic models (PTM). Many known and new algorithms for PLSA, LDA, and SWB models can be obtained as its special cases by choosing a subset of the following “options”: regularization, sampling, update frequency, sparsing and robustness. We show that a robust topic model, which distinguishes specific, background and topic terms, doesn’t need Dirichlet regularization and provides controllably sparse solution.