Additive Regularization for Hierarchical Multimodal Topic Modeling
Probabilistic topic models uncover the latent semantics of text collections and represent each document by a multinomial distribution over topics. Hierarchical models divide topics into subtopics recursively, thus simplifying information retrieval, browsing and understanding of large multidisciplinary collections. The most of existing approaches to hierarchy learning rely on Bayesian inference. This makes difficult the incorporation of topical hierarchies into other types of topic models. The authors use non-Bayesian multicriteria approach called Additive Regularization of Topic Models (ARTM), which enables to combine any topic models formalized via log-likelihood maximization with additive regularization criteria. In this work, such formalization is proposed for topical hierarchies. Hence, the hierarchical ARTM (hARTM) can be easily adapted to a wide class of text mining problems, e. g., for learning topical hierarchies from multimodal and multilingual heterogeneous data of scientific digital libraries or social media. The authors focus on topical hierarchies that allow a topic to have several parent topics which is important for multidisciplinary collections of scientific papers. The regularization approach allows one to control the sparsity of the parent–child relation and automatically determine the number of subtopics for each topic. Before learning the hierarchy, it is necessary to fix the number of topics for each layer. The additive regularization does not complicate the learning algorithm; so, this approach is well scalable on large text collections.
The aim of this article is to analyze the discursive background for the characters of teachers in the Soviet school story of the afterwar period. The 1,8 million words corpus for the study was compiled of the novels about school and schooling by 37 authors, written in 1940-s — 1980-s. The contents of the episodes where the keywords (headmaster, deputy headmaster, teacher, female teacher) were mentioned was analyzed automatically with the help of probabilistic topic modeling (LDA). Topics significantly more or less common in these episodes than in the whole corpus were used to characterize discursive context for the keywords. Judging by the thematic profile the term ‘female teacher’ is opposed to all the rest, Meaningful contrasts distinguishing the thematic ptofiles of the terms are: disourse of the upbringing and everyday schooling, komsomol and pioneers, emotions and gender.
The paper deals with multilevel regression modelling (MLM) as a method preferred to the ordinary least-squares regression in the analysis of comparative data with hierarchical data structure. We present substantive reasons (contextual sources of heterogeneity, causal heterogeneity, and generalisability of results) and statistical reasons (obtaining more precise and reliable estimates) for multilevel modelling. We also provide an overview of MLM implementation in several statistical packages. Using the cross-national World Values Survey (WVS) data, we outline a step-by-step procedure for building and fitting a two-level linear regression model of generalized trust on educational attainment levels (the “null” model, the fixed-intercept model, the random-intercept model, the random-intercept random-slope model, the model with a country-level predictor, and the cross-level interaction model). Then we describe and compare existing goodness-of-fit measures for MLM (AIC, BIC, maximum likelihood functions, and pseudo-R2). We also demonstrate robustness check techniques for multilevel models (visualization, Cook’s distance, and DFBETAs). In the final section, we overview alternative approaches to multilevel modelling when dealing with hierarchical data (cluster robust standard errors, generalized estimating equations, country fixed effects, country means, and aggregation) as currently practiced in comparative cross-national social science research. The replicable R code is attached.
Topic modeling is a widely used approach for clustering text documents, however, it possesses a set of parameters that must be determined by a user, for example, the number of topics. In this paper, we propose a novel approach for fast approximation of the optimal topic number that corresponds well to human judgment. Our method combines renormalization theory and Renyi entropy approach. The main advantage of this method is computational speed which is crucial when dealing with big data. We apply our method to Latent Dirichlet Allocation model with Gibbs sampling procedure and test our approach on two datasets in different languages. Numerical results and comparison of computational speed demonstrate significant gain in time with respect to standard grid search methods.
Probabilistic topic modeling of text collections is a powerful tool for statistical text analysis. In this tutorial we introduce a novel non-Bayesian approach, called Additive Regularization of Topic Models. ARTM is free of redundant probabilistic assumptions and provides a simple inference for many combined and multi-objective topic models.
More and more attention in theoretical studies are beginning to attract non-financial aspects of the functioning of enterprises, influencing management decisions. One of the most common concepts in this direction is an integrated theory of financial architecture. The greatest research interest of an extensive set of characteristics of the financial architecture cause the ownership structure and the composition of the board of directors. Found that diversification of Directors, including gender, provides for a multilateral view on leadership development company, improves its reputation and increase investor interest in her.
The main objective of this study is to analyze the non-financial aspects of the economic efficiency of enterprises, taking into account not only the time and individual effects of firms, but also taking into account unobserved industry and geographical characteristics. The sample is a panel of Western European companies, collected over the period from 2007 to 2012 on the basis of the database and Bloomberg Amadeus. The multilevel structure of models allows you to clean assess the effects of test performance (proportion and number of women on the board of directors, as well as the share held by the largest shareholder and the total stake in the three largest shareholders) from the influence of unobserved variables causing heterogeneity of companies, countries, sectors and time periods. The combination of the estimated models, the quadratic specification of indicators of gender diversification of the board of directors and their works on the performance of companies with the heterogeneity of the coefficients of these indicators allows to solve the contradiction between the content of the theoretical hypotheses of the study and the results of the preliminary analysis of the data. It was found that increasing the number of women (percentage of women) in directors contributes to the efficiency of the company only up to a certain limit, after which the efficiency drops. This effect was observed in the majority of the estimated models and versions monitored for possible endogeneity using, instead of the current values of the tested parameters of their lags. In the analysis of the effect of limiting the number of women on the Board of Directors on the effectiveness of the companies according to the in-house performance in a number of models found decreasing returns to total assets and financial leverage and increasing returns for expenditure on research and innovation. Conflicting results were found for the size of companies: the estimates obtained by OLS, show constant returns estimates of hierarchical models, taking into account country heterogeneity exhibit increasing returns, and evaluation of models, taking into account sectoral heterogeneity evidence of diminishing returns on the number of women on the Board of Directors on strategic effectiveness companies.
The effect of concentration of ownership of the strategic performance of companies in the vast majority of the estimated models can not be found.
A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traﬃc is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the ﬁnal node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a ﬁnite-dimensional system of diﬀerential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of diﬀerential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.