Corpus-Based Text Retrieval and Adaptation for Learning System

N. Karpov

?

Corpus-Based Text Retrieval and Adaptation for Learning System

International Journal of Advances in Computer Science and Its Applications. 2014. Vol. 4. No. 2. P. 38–43.

Karpov N.

The algorithm to adapt lexical complexity in the news article which can be used as materials for learning language presented in the paper. We consider words substitution retrieval according to wordnet-based and corpus-based semantic relatedness. Two corpus-based similarity measures empirically tested: Vector Space Model and Distributional Semantic Model. This language processing algorithm has created as a client-server application. It retrieves appropriate text from Web-resource. Next it performs adaptation procedure.

Priority areas: IT and mathematics

Language: English

Full text

Text on another site

Keywords: natural language processing автоматическая обработка текста

Publication based on the results of:

Адаптация языкового материала НКРЯ для электронного учебника «Русский язык как иностранный» (2013)

ML-based Fast Simulation of FARICH Responses

Shipilov F., Barnyakov A., Ivanov A. et al., / Series Physics "arxiv.org". 2026.

A fast simulation of the detector response is a vital task in high-energy physics (HEP). Traditional Monte-Carlo methods form the backbone of modern particle physics simulation software but are computationally expensive. We present a machine-learning-based approach to fast simulation of the Focusing Aerogel Ring Imaging Cherenkov (FARICH) detector response. Given a particle track and momentum, ...

Added: May 19, 2026

Natural hazard database from Internet publications: text mining with a large language model

Derkacheva A., Sakirkina M., Kraev G. et al., /. 2026.

Comprehensive data on natural hazards and their consequences are crucial for effective for risk assessment, adaptation planning, and emergency response. However, many countries face challenges with fragmented, inconsistent, and inaccessible data, particularly regarding local-scale events. To address this data gap in Russia, we developed an end-to-end processing pipeline that scrapes news from various online sources, ...

Added: April 28, 2026

Algorithmic overlaps as thermodynamic variables: from local to cluster Monte Carlo dynamics in critical phenomena

Pilé I., Deng Y., Shchur L., / Series arXiv "math". 2026. No. 2604.10254.

We investigate the spatial overlap of successive spin configurations in Markov chain Monte Carlo simulations using the local Metropolis algorithm and the Svendsen-Wang and Wolff cluster algorithms. We examine the dynamics of these algorithms for two models in different universality classes: the Ising model and the Potts model with three components. The overlap of two ...

Added: April 20, 2026

Using predefined vector systems to speed up neural network multimillion class classification

Gabdullin N., Androsov I., / Series Computer Science "arxiv.org". 2026.

Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can ...

Added: April 2, 2026

RuCLEVR: A Russian Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Biryukova K., Chelnokova D., Erkenova J. et al., Communications in Computer and Information Science 2024 Vol. 2364 CCIS P. 109 – 121

Added: February 25, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026

Implementing Transport Coding in OMNeT++ for Message Delay Reduction

Petrovanov I., Sergeev A., / Series Computer Science "arxiv.org". 2025. No. 2512.18332.

Transport coding reduces message delay in packet-switched networks by introducing controlled redundancy at the transport layer: original packets are encoded into coded packets, and the message is reconstructed after the first successful deliveries, effectively shifting latency from the maximum packet delay to the -th order statistic. We present a concise, reproducible discrete-event implementation of transport coding in OMNeT++, including ...

Added: December 24, 2025

Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

Меньшиков И. А., Бернадотт А. К., Elvimov N. S., / Series arXie "Statistical mechanics". 2025.

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image ...

Added: December 1, 2025

Determining the boundary of dynamical chaos in the generalized Chirikov map via machine learning

Chernyshov D., Satanin A., Shchur L., / Series arXiv "math". 2025.

We investigate the boundary separating regular and chaotic dynamics in the generalized Chirikov map, an extension of the standard map with phase-shifted secondary kicks. Lyapunov maps were computed across the parameter space (K,K(α, τ)) and used to train a convolutional neural network (ResNet18) for binary classification of dynamical regimes. The model reproduces the known critical ...

Added: November 21, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Diffusion on language model embeddings for protein sequence generation

Meshchaninov V., Strashnov, P., Shevtsov A. et al., / Cornell University. Серия CoRR, arXiv:2403.03726 "Computing Research Repository,". 2025.

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived ...

Added: October 5, 2025

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Shabalin A., Meshchaninov V., Vetrov D., / Series cs.CL, arXiv:2505.18853 "Computation and Language". 2025.

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic ...

Added: October 5, 2025

A Feature Engineering Framework for Computer Vision Based on Topological Data Analysis

Абрамов А. С., Chernyshev V. L., Mikhaylets E. et al., / Series Social Science Research Network "Social Science Research Network". 2025.

Computer vision is one of the most relevant modern research areas with broad practical applications. However, traditional solutions based on deep learning have signicant limitations and can be misleading. Topological data analysis, on the other hand, is a modern approach to solving similar problems using mathematically deterministic methods of algebraic topology that reduce the risk ...

Added: September 23, 2025

On the construction of frieze patterns from partitions of convex polygons by nonintersecting diagonals

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 07600.

We demonstrate in an elementary way how to construct a frieze pattern of width m-3 from a partition of a convex m-gon by not intersecting diagonals. ...

Added: September 17, 2025

On one property of Catalan numbers

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 20584.

We give a new proof of the following statement: the Catalan number C_n is divisible by n+2, if n is odd and n<> 3k+1. ...

Added: September 9, 2025

Rewriting the Rules: LLMs Vs. Traditional ML in University Admissions

Chepikov I., Karpov I., , in: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part I. Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED.: Springer, 2025. P. 352 – 358.

Modern LLM models such as BERT, ChatGPT, DeepSeek have shown great potential in solving various tasks, including text classification, text generation, analysis and summary of documents. In this paper, we show that these models close to classical ML approaches based on decision trees not only in text processing, but also in processing classical tabular data ...

Added: September 4, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Cross-country analysis of science, technology and innovation policies: non-covid-19 related and Covid-19 specific STI policies in OECD countries

Russo M., Pavone P., Meissner D. et al., Quality and Quantity 2025 Vol. 59 No. Suppl 1 P. S343–S367

In OECD countries, Science, Technology and Innovation (STI) policies were seen as key aspects of coping with the Covid-19 pandemic. Now that the pandemic is over, identifying which policy mix portfolios characterised countries in terms of their non-Covid-19 related and Covid-19 specific STI policies fills a knowledge gap on changes in STI policies induced by ...

Added: September 27, 2024

Parameter-Efficient Tuning of Transformer Models for Anglicism Detection and Substitution in Russian

Daniil Lukichev, Kryanina Darya, Anastasia Bystrova et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 295–306.

Added: April 25, 2024

Анализ ошибок морфологического анализатора MyStem при работе с записями детской речи

Lelik V., Eremicheva T., Morozova D. et al., В кн.: Когнитивная наука в Москве: новые исследования. Материалы конференции 21–22 июня 2023 г.: М.: «Буки Веди», Московский институт психоанализа, 2023. С. 274–279.

Some of the important conditions of the effectiveness of morphological analyzers are the correct recognition of unfamiliar words and successful morphological disambiguation. In this work, we evaluated the results of automatic processing of children’s spontaneous speech using the morphological analyzer MyStem. We analyzed the longitudinal spontaneous speech recordings of two bilingual children and their parents ...

Added: April 5, 2024

Explainable Document Classification via Pattern Structures

Sergei O. Kuznetsov, Parakal E. G., Lecture Notes in Networks and Systems 2023 Vol. 776 P. 423–434

Inherently explainable Machine Learning (ML) models are able to provide explanations for their predictions by virtue of their construction. The explanations of a ML model are more comprehensible if they are expressed in terms of its input features. Our paper proposes an inherently explainable pipeline for document classification using pattern structures and Abstract Meaning Representation ...

Added: February 5, 2024

Business Process Management Workshops. BPM 2023 International Workshops, Utrecht, The Netherlands, September 11–15, 2023, Revised Selected Papers

Switzerland: Springer, 2024.

This book constitutes revised papers from the International Workshops held at the 21st International Conference on Business Process Management, BPM 2023, in Utrecht, The Netherlands, during September 2023. Papers from the following workshops are included: • 7th International Workshop on Artificial Intelligence for Business Process Management (AI4BPM 2023) • 7th International Workshop on Business Processes Meet Internet-of-Things (BP-Meet-IoT ...

Added: January 17, 2024

Проект Chekhov Digital: задачи и проблемы реализации семантической разметки текстов (на примере рассказа А. П. Чехова «Смерть чиновника»)

Северина Е. М., Ларионова М. Ч., Litera 2023 № 10 С. 211–222

The article considers a model of preparation of machine-readable (semantic) markup of texts for the Chekhov Digital project on the example of philological interpretation of individual significant elements of A. P. Chekhov's story "Death of an Official" and presentation of this information explicitly based on the standards of digital publication Text Encoding Initiative (TEI/XML). Based ...

Added: January 12, 2024

РАЗРАБОТКА СИСТЕМЫ ГЕНЕРАЦИИ ПОВСЕДНЕВНЫХ ДИАЛОГОВ НА РУССКОМ ЯЗЫКЕ: ПИЛОТНОЕ ИССЛЕДОВАНИЕ

Кругликова В. Г., В кн.: Анализ речи: теоретические и прикладные аспекты: сборник научных статей.: [б.и.], 2023.

The article presents a comparative analysis of various language models used to generate texts and evaluates their effectiveness for the task of generating conversational speech. There are such models as GPT-3, BERT, LSTM involved in the comparative analysis. This study is part of a project of developing a system for generating dialogues in Russian. The ...

Added: December 10, 2023