Texterra: инфраструктура для анализа текстов

?

Texterra: инфраструктура для анализа текстов

Труды Института системного программирования РАН. 2014. Т. 26. С. 421–438.

Денис Турдаков, Астраханцев Н. А., Недумов Я. Р., Сысоев А. В., Андрианов И. А., Майоров В. Д., Федоренко Д. Г., Коршунов А. В., Сергей Кузнецов

he paper presents a framework for fast text analytics developed during the Texterra project. Texterra is a technology for multilingual text mining based on novel text processing methods that exploit knowledge extracted from user-generated content. It delivers a fast scalable solution for text mining without the expensive customization. Depending on use-cases Texterra could be utilized as a library, extendable framework or scalable cloudbased service. This paper describes details of the project, use-cases and results of evaluation for all developed tools.

Texterra utilizes Wikipedia as a primary knowledge source to facilitate text mining in arbitrary documents (news, blogs, etc). We mine the graph of Wikipedia’s links to compute semantic relatedness between all concepts described in Wikipedia. As a result, we build a semantic graph with more than 5 million concepts. This graph is exploited to interpret meanings and relationships of terms in text documents.

In spite of large size, Wikipedia doesn’t contain information about many domain-specific concepts. In order to increase applicability of the technology we developed several automatic knowledge extraction tools. These tools include systems for knowledge extraction from MediaWiki resources and Linked Data resources, as well as system for knowledge base extension with concepts described in arbitrary text documents using original information extraction techniques.

In addition, utilization of information from Wikipedia allows easily extend Texterra for support of new Natural languages. The paper presents evaluation of Texterra applied for different text processing tasks (part-of-speech tagging, word sense disambiguation, keyword extraction and sentiment analysis) for English and Russian.

Multimodal graph, surface, and language-based model for protein protein interaction prediction

Arteaga Moreano B. D., Poptsova M., Scientific Reports 2026 No. 16 Article 4772

Accurate prediction of protein-protein interactions (PPIs) is fundamental to understanding biological processes and disease mechanisms. While deep learning offers a powerful alternative to costly experimental methods, existing approaches often overlook critical protein-surface information and rely on simplistic feature fusion techniques, thereby limiting performance. To address this, we introduce GSMFormer-PPI, a novel multimodal framework that integrates ...

Added: February 4, 2026

Алгоритмическая сложность теорий с итерацией Клини

Kuznetsov S., Успехи математических наук 2026 Т. 81 № 1 С. 137–204

Итерация (звёздочка) Клини – это одна из наиболее интересных алгебраических операций, встречающихся в теоретической информатике. Исследования структур с этой операцией – алгебр Клини и их расширений – начинаются с классического понятия регулярных выражений, задающих формальные языки. Впоследствии были введены так называемые алгебры действий (В. Пратт, 1991 г.; Д. Козен, 1994 г.), или алгебры Клини с делениями. В этих структурах звёздочка Клини сочетается с делениями, согласованными с частичным порядком (такие ...

Added: February 4, 2026

SMMR: Sampling-Based MMR Reranking for Faster, More Diverse, and Balanced Recommendations and Retrieval

Ananyeva M., Liakhnovich K., Lashinin O. et al., Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval 2025 P. 2754–2758

Relevance and diversity are critical objectives in modern information retrieval (IR), particularly in recommender systems. Achieving a balance between relevance (exploitation) and diversity (exploration) optimizes user satisfaction and business goals such as catalog coverage and novelty. While existing post-processing reranking methods address this trade-off, they usually rely on greedy strategies, leading to suboptimal outcomes for ...

Added: February 3, 2026

Natural Language Processing and Information Systems : 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, Kanazawa, Japan, July 4-6, 2025 : proceedings. Part I

Springer, 2025.

The two-volume set LNCS 15836 and 15837 constitutes the proceedings of the 30th International Conference on Applications of Natural Language to Information Systems, NLDB 2025, held in Kanazawa, Japan, during July 4–6, 2025. The 33 full papers, 19 short papers and 2 demo papers presented in this volume were carefully reviewed and selected from 120 submissions. ...

Added: February 3, 2026

Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework

Tutubalina E., Храбров К., Ганеева В. et al., Journal of Cheminformatics 2025 No. 17 Article 164

The recent integration of natural language processing into chemistry has advanced drug discovery. Molecule representations in language models (LMs) are crucial to enhance chemical understanding. We explored the ability of models to match the same chemical structures despite their different representations. Recognizing the same substance in different representations is an important component of emulating the ...

Added: February 3, 2026

A Clustering Model for Stocks that Considers Hidden Dynamics and Price Trajectory

Sizykh N., Sizykh D., Morychev G., IEEE Access 2025 Vol. 13 P. 213194–213210

One of the main tools for analyzing large volumes of financial data is the use of clustering methods and models, which allow the identification of various patterns. This study examines the problem of clustering time series that reflect the behavior of prices, yields, modes, trends, and a number of related stock indicators. The relevance and ...

Added: February 3, 2026

Автоматическое выявление побуждений в тексте: применение методов компьютерной лингвистики в работе эксперта-лингвиста

П.Е. Белова, А.К. Сафарян, В кн.: Научно-практическая конференция с международным участием "Национальные и международные тенденции и перспективы развития судебной экспертизы". Сборник докладов.: Н. Новгород: Изд-во ННГУ им. Н.И. Лобачевского, 2024.

В данной статье представлено описание системы автоматического поиска и извлечения побуждений из текстов на русском языке FindImper, основанной на поиске глагольных форм и синтаксических связей. Алгоритм реализован на языке программирования Python с использованием библиотек для морфологического и синтаксического анализа и набора правил. Данный инструмент направлен на оптимизацию работы эксперта-лингвиста и доступен к использованию через веб-сайт ...

Added: January 30, 2026

Метод преобразования речевого сигнала для улучшения разборчивости речи

Савченко В. В., Savchenko L., Радиотехника и электроника 2025 Т. 70 № 8 С. 753–760

The problem of improving speech intelligibility in voice communication systems is considered. The acute issue of speaker recognition when applying known methods for solving this problem is highlighted. To overcome the specified problem, a new method for transforming the speech signal based on an autoregressive model of the vocal tract and the principle of frequency-selective ...

Added: January 29, 2026

Specification Tests for Jump-Diffusion Models Based on the Characteristic Function

Belomestny D., Grobler G. L., Meintanis S. G. et al., International Statistical Review 2026 P. 1–31

Goodness-of-fit tests are suggested for several popular jump-diffusion processes. The suggested test statistics utilise the marginal characteristic function of the model and its L2-type discrepancy from an empirical counterpart. Model parameters are estimated either by minimising the aforementioned L2-type discrepancy or by maximum likelihood. A hybrid estimation method that uses moment estimation is also proposed ...

Added: January 29, 2026

An Analysis of Sequential Patterns in Datasets for Evaluation of Sequential Recommendations

Klenitskiy A., Anna Volodkevich, Pembek A. et al., ACM Transactions on Recommender Systems 2026

Sequential recommender systems are an important and in-demand area of research. These systems aim to use the order of interactions in a user’s history to predict future interactions. The premise is that the order of interactions and sequential patterns play an essential role. Therefore, it is crucial to use datasets that exhibit a sequential structure ...

Added: January 28, 2026

Autoregressive generation strategies for Top-K sequential recommendations

Klenitskiy A., Гусак Д. И., Володкевич А. А. et al., User Modelling and User-Adapted Interaction 2025 No. 35 Article 13

The goal of modern sequential recommender systems is often formulated in terms of next-item prediction. In this paper, we explore the applicability of transformer-based generative models for the Top-K sequential recommendation task, where the goal is to predict items that a user is likely to interact with in the “near future.” This goal aligns with ...

Added: January 26, 2026

Marchenko–Pastur Law for Spectra of Random Weighted Bipartite Graphs

Nadutkina A., Tikhomirov A., Timushev D., Siberian Advances in Mathematics 2025 Vol. 34 P. 146–153

We study the spectra of random weighted bipartite graphs. We establish that, under specific assumptions on the edge probabilities, the symmetrized empirical spectral distribution function of the graph’s adjacency matrix converges to the symmetrized Marchenko-Pastur distribution function. ...

Added: January 26, 2026

Conceptual Knowledge Structures First International Joint Conference, CONCEPTS 2024, Cádiz, Spain, September 9–13, 2024, Proceedings

Obiedkov S., Switzerland: Springer, 2024.

This book constitutes the proceedings of the First International Joint Conference on Conceptual Knowledge Structures, CONCEPTS 2024, which took place in Cádiz, Spain, during September 9-13, 2024. The conference is an amalgamation of the 18th International Conference on Formal Concept Analysis (ICFCA); the 17th International Conference on Concept Lattices and Their Applications (CLA); and the 28th ...

Added: January 23, 2026

Cooperative games with fuzzy characteristic functions on concept lattices

Kemgne M. W., Njionou B. B., Ignatov D. I. et al., International Journal of Approximate Reasoning 2025 Vol. 186 P. 1–18

This paper introduces cooperative games with transferable utilities and fuzzy characteristic functions on concept lattices. While previous works have independently addressed games with fuzzy payoffs and games restricted to structured coalition systems such as lattices, our approach combines both perspectives. We consider cooperative settings where coalition formation is constrained by a concept lattice structure, and ...

Added: January 23, 2026

Run time dynamic digital twins and dynamic digital twins networks

Vodyaho A., Delhibabu R., Ignatov D. I. et al., Future Generation Computer Systems 2025 Vol. 172 P. 1–18

Digital twins are widely used for building various types of cyber–physical systems. There are a huge number of publications devoted to the use of digital twins in production systems. Much less attention is paid to the issues of building runtime digital twins. The article describes an approach to building complex distributed cyber–physical systems with a ...

Added: January 23, 2026

LAMBO: Landmarks Augmentation With Manifold-Barycentric Oversampling

Bespalov Y., Buzun N., Kachan O. et al., IEEE Access 2022 No. 10 Article 3219934

We propose the first data augmentation method based on optimal transport theory, with the generated data being guaranteed to belong to the original data manifold. The proposed algorithm randomly samples a clique in the nearest-neighbors graph representing the data knowledge and computes the Wasserstein barycenter between the neighbours with random uniform weights. Being extremely natural- ...

Added: January 21, 2026

Blurred Magnitude Homology of Functional Connectome for ASD Diagnosis

Alexander Kachura, Vsevolod Chernyshev, Kachan O. et al., Frontiers in Psychiatry 2026 Vol. 16 Article 1677282

Autism spectrum disorder (ASD) is one of the most common neurodevelopmental disorders. Existing studies show that adults with ASD may experience accelerated or altered neurocognitive aging. Consequently, cognitive decline in people with ASD can be delayed if timely measures are taken to treat this disorder. This study focuses on the development of a new algorithm ...

Added: January 21, 2026

19th Annual Conference, TAMC 2025, Jinan, China, September 19–21, 2025, Proceedings. Theory and Applications of Models of Computation. Lecture Notes in Computer Science (LNCS, volume 16084)

Springer, 2026.

This book constitutes the proceedings of the 19th Annual Conference on Theory and Applications of Models of Computation, TAMC 2025, which was held in Jinan, China, during September 19–21, 2025. ...

Added: January 20, 2026

11th Russian Supercomputing Days, RuSCDays 2025, Moscow, Russia, September 29–30, 2025, Revised Selected Papers

Springer, 2026.

Added: January 20, 2026

Automatic detection of dyslexia based on eye movements during reading in Russian

Laurinavichyute A., Lopukhina A., Reich D., , in: Proceedings of the 63rd Annual Meeting of the Association for Computational LinguisticsVol. 2: Short papers.: Wien: Association for Computational Linguistics, 2025. P. 59–66.

Dyslexia, a common learning disability, requires an early diagnosis. However, current screening tests are very time- and resourceconsuming. We present an LSTM that aims to automatically classify dyslexia based on eye movements recorded during natural reading combined with basic demographic information and linguistic features. The proposed model reaches an AUC of 0.93 and outperforms the ...

Added: January 19, 2026

Computer-aided system for assessing and selecting effective masters' learning trajectory in variability of external factors considering the university industrial partners' opinion

A. V. Vishnekov, E. M. Ivanova, N. Zhursunova, Информатика и образование 2025 Vol. 40 No. 6 P. 39–48

The modern education system development is characterized by many uncertain, dynamically changing factors. The purpose and originality of the study presented in the article is to develop an automated system for building an effective educational trajectory in the conditions of external factors’ uncertainty. The developed system is a tool for assessment and dynamic adjustment of ...

Added: January 16, 2026

Применение машинного обучения для прогнозирования волатильности и улучшения торговых стратегий на российском фондовом рынке

Lysenok N., Фундаментальная и прикладная математика 2025 Т. 25 № 4 С. 90–107

The aim of the study is to assess to what extent modern machine learning methods can improve the accuracy of forecasting the volatility of Russian stocks and whether such improvements lead to real advantages when applied in investment strategies. The work combines a review of theoretical approaches to volatility analysis with empirical research based on ...

Added: January 16, 2026

Многоаспектная оценка методов адаптации токенизатора для больших языковых моделей на русском языке

Андрющенко Г. Д., Godunova M., Иванов В. В. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 320–331

Large language models (LLMs) pretrained on English-centered corpora have biases and perform sub-optimally on other natural languages. Adaptation of LLMs vocabulary provides a resource-efficient way to improve the quality of a pretrained model. Previously proposed adaptation techniques focus on performance (accuracy) and size metrics (fertility), ignoring other aspects in comparison, such as inference latency, compute ...

Added: January 15, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026