Тематические модели: добавление биграмм и учет сходства между униграммами и биграммами

М. А. Нокель; Н. В. Лукашевич

?

Тематические модели: добавление биграмм и учет сходства между униграммами и биграммами

Вычислительные методы и программирование. 2015. Т. 16. № 2. С. 215–234.

Nokel M., Loukachevitch N. V.

The results of experimental study of adding bigrams and taking account of the similarity between them and unigrams are discussed. A novel PLSA-SIM algorithm based on a modification of the original PLSA (Probabilistic Latent Semantic Analysis) algorithm is proposed. The proposed algorithm incorporates bigrams and takes into account the similarity between them and unigram components. Various word association measures are analyzed to integrate top-ranked bigrams into topic models. As target text collections, articles from various Russian electronic banking magazines, English parts of parallel corpora Europarl and JRC-Acquiz, and the English digital archive of research papers in computational linguistics (ACL Anthology) are chosen. The computational experiments show that there exists a subgroup of tested measures that produce top-ranked bigrams in such a way that their inclusion into the PLSA-SIM algorithm significantly improves the quality of topic models for all collections. A novel unsupervised iterative algorithm named PLSA-ITER is also proposed for adding the most relevant bigrams. The computational experiments show a further improvement in the quality of topic models compared to the PLSA algorithm.

Priority areas: IT and mathematics

Language: Russian

Text on another site

Keywords: perplexity Topic Models тематические модели PLSA PLSA bigrams topic coherence биграммы согласованность тем перплексия word association measures ассоциативные меры

Using predefined vector systems to speed up neural network multimillion class classification

Gabdullin N., Androsov I., / Series Computer Science "arxiv.org". 2026.

Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can ...

Added: April 2, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026

Implementing Transport Coding in OMNeT++ for Message Delay Reduction

Petrovanov I., Sergeev A., / Series Computer Science "arxiv.org". 2025. No. 2512.18332.

Transport coding reduces message delay in packet-switched networks by introducing controlled redundancy at the transport layer: original packets are encoded into coded packets, and the message is reconstructed after the first successful deliveries, effectively shifting latency from the maximum packet delay to the -th order statistic. We present a concise, reproducible discrete-event implementation of transport coding in OMNeT++, including ...

Added: December 24, 2025

Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

Меньшиков И. А., Бернадотт А. К., Elvimov N. S., / Series arXie "Statistical mechanics". 2025.

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image ...

Added: December 1, 2025

Determining the boundary of dynamical chaos in the generalized Chirikov map via machine learning

Чернышов Д. П., Satanin A., Shchur L., / Series arXiv "math". 2025.

We investigate the boundary separating regular and chaotic dynamics in the generalized Chirikov map, an extension of the standard map with phase-shifted secondary kicks. Lyapunov maps were computed across the parameter space (K,K(α, τ)) and used to train a convolutional neural network (ResNet18) for binary classification of dynamical regimes. The model reproduces the known critical ...

Added: November 21, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Diffusion on language model embeddings for protein sequence generation

Meshchaninov V., Strashnov, P., Shevtsov A. et al., / Cornell University. Серия CoRR, arXiv:2403.03726 "Computing Research Repository,". 2025.

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived ...

Added: October 5, 2025

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Shabalin A., Meshchaninov V., Vetrov D., / Series cs.CL, arXiv:2505.18853 "Computation and Language". 2025.

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic ...

Added: October 5, 2025

A Feature Engineering Framework for Computer Vision Based on Topological Data Analysis

Абрамов А. С., Chernyshev V. L., Mikhaylets E. et al., / Series Social Science Research Network "Social Science Research Network". 2025.

Computer vision is one of the most relevant modern research areas with broad practical applications. However, traditional solutions based on deep learning have signicant limitations and can be misleading. Topological data analysis, on the other hand, is a modern approach to solving similar problems using mathematically deterministic methods of algebraic topology that reduce the risk ...

Added: September 23, 2025

On the construction of frieze patterns from partitions of convex polygons by nonintersecting diagonals

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 07600.

We demonstrate in an elementary way how to construct a frieze pattern of width m-3 from a partition of a convex m-gon by not intersecting diagonals. ...

Added: September 17, 2025

On one property of Catalan numbers

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 20584.

We give a new proof of the following statement: the Catalan number C_n is divisible by n+2, if n is odd and n<> 3k+1. ...

Added: September 9, 2025

Low Sets and Closure Properties of Counting Function Classes

Ivanashev Y., / Series Computer Science "arxiv.org". 2025.

Added: July 29, 2025

ComputAgeBench: Epigenetic Aging Clocks Benchmark

Dudkovskaia Anastasiia, / Series 005140 "Biorxiv". 2025.

The success of clinical trials of longevity drugs relies heavily on identifying integrative health and aging biomarkers, such as biological age. Epigenetic aging clocks predict the biological age of an individual using their DNA methylation profiles, commonly retrieved from blood samples. However, there is no standardized methodology to validate and compare epigenetic clock models as ...

Added: July 18, 2025

An archaic reference-free method to jointly infer Neanderthal and Denisovan introgressed segments in modern human genomes

Planche L., Ilina A., Ávila-Arcos M. et al., / Series 005140 "Biorxiv". 2025.

Admixture between populations is a common feature of human history. Admixture events introduce new genetic variation that can fuel evolution. Characterizing the significance of admixture events on the evolution of a population across various species is of great interest to evolutionary geneticists. Local Ancestry Inference (LAI) methods infer genetic ancestry of an individual at a ...

Added: May 19, 2025

Data-Driven Approach To Patient Flow Management And Resource Utilization In Urban Medical Facilities

Elizaveta S. Prokofyeva, Svetlana V. Maltseva, Fomichev N. et al., , in: 2020 IEEE 22nd Conference on Business Informatics (CBI).: IEEE, 2020. P. 71–77.

Healthcare services are tightly connected with complex data analysis techniques to enable optimal resource allocation in medical institutions. This paper proposes a detailed analysis of incoming patient flow to local polyclinic by integrating clustering techniques, process mining and a concept of self-organizing systems. The study takes into account concepts based on models of managing social ...

Added: August 31, 2020

Evaluating Distributional Semantic Models with Russian Noun-Adjective Compositions

Panicheva P., Protopopova E., Bukia G. et al., , in: Analysis of Images, Social Networks and Texts. 5th International Conference, AIST 2016, Yekaterinburg, Russia, April 7-9, 2016, Revised Selected Papers. Communications in Computer and Information ScienceVol. 661.: Switzerland: Springer, 2017. P. 236–247.

In the paper vector-space semantic models based on Word2Vec word embeddings algorithm and a count-based association-oriented algorithm are evaluated and compared by measuring association strength between Russian nouns and adjectives. A dataset of nouns and associated adjectives is used as the test set for pseudodisambiguation task. Models are trained with corpora of Russian fiction. A ...

Added: February 18, 2019

A Method of Accounting Bigrams in Topic Models

Nokel M., Loukachevitch N. V., , in: NAACL HLT 2015 11th Workshop on Multiword Expressions MWE 2014.: NY: Association for Computational Linguistics, 2015. P. 1–9.

The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSA-SIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. ...

Added: March 16, 2016

Topic Models: Accounting Component Structure of Bigrams

Nokel M., Loukachevitch N. V., , in: Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015).: Linköping: Linköping University Electronic Press, 2015. P. 145–152.

The paper describes the results of an empirical study of integrating bigram collocations and similarities between them and unigrams into topic models. First of all, we propose a novel algorithm PLSASIM that is a modification of the original algorithm PLSA. It incorporates bigrams and maintains relationships between unigrams and bigrams based on their component structure. ...

Added: March 16, 2016

Метод учёта структуры биграмм в тематических моделях

Nokel M., Вестник Воронежского государственного университета. Серия: Системный анализ и информационные технологии 2014 № 4 С. 89–97

The paper presents the results of experimental study of integrating word similarity and bigram collocations into topic models. First of all, we analyze a variety of word association measures in order to integrate top-ranked bigrams into topic models. Then we propose a modification of the original algorithm PLSA, which takes into account similar unigrams and ...

Added: March 15, 2016

Topic Models Regularization and Initialization for Regression Problems

Sokolov E., Bogolubsky L., , in: Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications.: NY: ACM, 2015. P. 21–27.

We propose a new method of feature extraction for regression problems with text data that transforms the sparse texts to dense features using regularized topic models. We also discuss the problem of topic model initialization, and propose a new approach based on Naive Bayes. This approach is compared to many others, and it achieves a ...

Added: February 24, 2016