Регуляризация вероятностных тематических моделей для повышения интерпретируемости и определения числа тем
Probabilistic topic models discover a low-dimensional interpretable representation of text corpora by estimating a multinomial distribution over topics for each document and a multinomial distribution over terms for each topic. A unied family of expectation-maximization (EM) like algorithms with smoothing, sampling, sparsing, and robustness heuristics that can be used in any combinations is considered. The known models PLSA (probabilistic latent semantic analysis), LDA (latent Dirichlet allocation), SWB (special words with background), as well as new ones can be considered as special cases of the presented broad family of models. A new simple robust algorithm suitable for sparse models that do not require to estimate and store a big matrix of noise parameters is proposed. The present authors nd experimentally optimal combinations of heuristics with sparsing strategies and discover that sparse robust model without Dirichlet smoothing performs very well and gives more than 99% of zeros in multinomial distributions without loss of perplexity.
Homogeneous and isotropic with respect to horizontal variables random fields are useful for study of geophysical (in particular, meteorological) functions of spatial-temporal variables. The following horizontal scale (30 — 3000 km), which is induced by the spatial scale of the observing grid for the Earth’s atmosphere and by the power of modern computers for solutions of the system of hydrothermodynamics equations, which included water phase transformations etc, is important for the weather forecast problems.
The correlation functions (CFs) of the random fields may be applied for the following goals:
1) For the optimal interpolation of the meteorological information from the points of observation into the points of a regular finite-difference grid, as well as (for the checking of some observations by other ones) into another point of the observation.
2) For the models’ testing, if a climate model simulates adequately not only mean fields, but the fields of the relative dispersions and CFs, too, then we should consider the climate model as a certain one.
The CFs are evaluated by the global checked archive of meteorological observations by meteorological sounds. A special regularization procedure provides the strong positive definiteness of the CFs. The areas in the Earth atmosphere, where the isotropy hypothesis is essentially not fulfilled, were localized by a special algorithm.
Let us consider an algorithm, which can construct atmospheric fronts that separate so named homogeneous synoptic atmospheric volumes. Then we can evaluate separately CFs for the ensemble of the pairs of points, which are in a unite volume and CFs for the ensemble of the pairs of points, which are in a various volumes. We can see the difference between the different CFs. The difference will be more for a better algorithm. So, we obtain a quality criterion for such algorithms. The statistical approach given possibility to optimize the algorithm with respect to a lot of numerical parameters. The optimal algorithm was exploited in the operative regime in Hydrometeorological Center of Russia. The similar algorithms of numerical construction of boundaries between homogeneous volumes by a discrete set of observations can be realized for various physical media.
Problem in the Modeling on the Basis of Regularization and Distributed Computing in the Everest Environment
A method for estimating mathematical models of physical spatial phenomena is presented. Estimating is based on the series of experimental data. The objective function in the inverse optimization problem of identification of model parameters includes a regularizing term with unknown weight coefficients for the 2nd derivatives of the spatial function describing the phenomenon. Successive cross-validation procedure is used to choose values of weight coefficients. This cross-validation consists in approximation of one subset of experimental data by processing of a complementary subset. The better accuracy of the “crossapproximation”, the better set of weight coefficients. Choosing direction of the possible improvement requires solving a number of subsidiary optimization problems. For that it is proposed to use distributed computing environment of optimization services deployed via Everest toolkit.
Topic modeling has emerged over the last decade as a powerful tool for analyzing large text corpora, including Web-based user-generated texts. Topic stability, however, remains a concern: topic models have a very complex optimization landscape with many local maxima, and even different runs of the same model yield very different topics. Aiming to add stability to topic modeling, we propose an approach to topic modeling based on local density regularization, where words in a local context window of a given word have higher probabilities to get the same topic as that word. We compare several models with local density regularizers and show how they can improve topic stability while remaining on par with classical models in terms of quality metrics.