Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

E. Artemova; B. Mirkin

?

Refining a Taxonomy by Using Annotated Suffix Trees and Wikipedia Resources

Annals of Data Science. 2015. Vol. 2. No. 1. P. 61–82.

Artemova E., Mirkin B.

A step-by-step approach to taxonomy construction is presented. On the first step, the upper layer frame of taxonomy is built manually according to educational materials. On the next steps, the frame is refined at a chosen topic using the Wikipedia category tree and articles, both cleaned of noise. Our main tool in this is a naturally defined string-to-text relevance score, based on annotated suffix trees. The relevance scoring is used at several tasks: (1) cleaning the Wikipedia tree or page set of noise; (2) allocating Wikipedia categories to taxonomy topics; (3) deciding whether an allocated category should be included as a child to the taxonomy topic, etc. The resulting fragment of taxonomy consists of three parts: the manually set upper layer topic, the adopted part of the Wikipedia category tree and Wikipedia articles as leaves. Every leaf is assigned a set of so-called descriptors; these are phrases explaining aspects of the leaf topic. The method is illustrated by its application to two domains in the area of Mathematics: (a) “Probability theory and mathematical statistics”, (b) “Numerical mathematics” (both in Russian).

Research target: Computer Science

Priority areas: IT and mathematics

Language: English

Full text

Text on another site

Keywords: таксономия аннотированное суффиксное дерево taxonomy refiniment annotated suffix tree phrase-to-text relevance википедия

Publication based on the results of:

Data Analysis and Decision Making in Socio-Economic and Political Systems (2015)

Incorporating Coulomb interactions with fixed charges in Moment Tensor Potentials and Equivariant Tensor Network Potentials

Korogod D., Chalykh O., Hodapp M. et al., Journal of Chemical Physics 2025 Vol. 164 No. 6

In this work, we incorporate long-range electrostatic interactions in the form of the Coulomb model with fixed charges into the functional form of short-range machine-learning interatomic potentials (MLIPs), particularly in the moment tensor potential and equivariant tensor network potential. We show that the explicit incorporation of the Coulomb interactions with fixed charges leads to a ...

Added: February 19, 2026

Relative Chaoticity of Natural Languages

Yerbolova A. S., Tomashchuk K., Kogan A. et al., Complexity 2026 P. 1–34

Tis paper presents a novel approach to analyzing and grouping natural languages based on the degree of their chaoticity. It clusters 52 languages from 18 language families, according to the value of the entropy–complexity pair, to reveal the chaotic properties of semantic trajectories. Te obtained clusters appear to be closely correlated with the family of ...

Added: February 16, 2026

Continuous software monitoring backed by process mining: a systematic literature review

Evgenii V. Stepanov, Mitsyuk A. A., International Journal of Data Science and Analytics 2026 Vol. 22 P. 1–29

Software systems are monitored constantly, as it is the only way to ensure their well-functioning. There are several approaches for software monitoring: starting with debugging and profiling of simple programs, and ending with large distributed systems which are monitored by a complex logging infrastructure. As a result of such a monitoring, aggregated numbers (i.e., the ...

Added: February 16, 2026

Learning to hear broken motors: Signature-guided data augmentation for induction motor diagnostics

Ali S., Khizhik A., Ryzhikov A. et al., Engineering Applications of Artificial Intelligence 2025 No. 170 Article 114137

The application of machine learning algorithms in the intelligent diagnosis of three-phase engine has the potential to significantly enhance diagnostic performance and accuracy. Traditional methods largely rely on signature analysis, which, despite being a standard practice, can benefit from the integration of advanced machine learning techniques. In our study, we innovate by combining machine learning ...

Added: February 16, 2026

The Fourteenth International Conference on Learning Representations (ICLR 2026)

International Conference on Learning Representations, 2026.

The Fourteenth International Conference on Learning Representations ...

Added: February 16, 2026

Операционная система Linux. Дистрибьюция программного обеспечения

Silakov D., Юрайт, 2025.

В курсе рассматривается операционная система Linux как платформа для разработки, сборки и распространения программного обеспечения. Предложены как классические подходы к доставке приложений с помощью пакетов, так и современные альтернативы, основанные на использовании контейнеров. Интерактивная комбинация теории, контрольных тестов и практических заданий обеспечивает эффективное и интересное погружение в учебный процесс как для студентов, так и для ...

Added: February 15, 2026

Качество программного кода. Позаботьтесь о долгой жизни ваших программных продуктов

Silakov D., Системный администратор 2025 № 10 С. 42–47

Понятие «качество программного продукта» включает в себя не только полноту и корректность реализации требуемого функционала, но и простоту поддержки и модификации программы. Как же обезопасить себя и коллег от кошмара поддержки нечитаемого кода? ...

Added: February 15, 2026

Искусственный интеллект в решении актуальных социальных и экономических проблем ХХI века : сборник статей по материалам Десятой всероссийской научно-практической конференции с международным участием

Yasnitsky L., Plotnikova E. G., Radionova M. V. et al., Пермский государственный национальный исследовательский университет, 2025.

Представлены материалы Десятой всероссийской научно-практической конференции с международным участием «Искусственный интеллект в решении актуальных социальных и экономических проблем ХХI века», которая проводилась 9–10 октября 2025 г. в Перми, ПГНИУ. Сборник предназначен для научных и педагогических работников, преподавателей, аспирантов, магистрантов, студентов и всех, кто интересуется и занимается проблемами развития и применения методов искусственного интеллекта. ...

Added: February 15, 2026

Special Issue Sensing Technology for Smart Cities: Data, Analytics, and Visualizations

Kharlamov A. A., Pilgun M., [б.и.], 2024.

The analysis of large volumes of data collected from heterogeneous sources is increasingly important for the development of megacities, the advancement of smart city technologies, and ensuring a high quality of life for citizens. This study aimed to develop algorithms for analyzing and interpreting social media data to assess citizens’ opinions in real time and ...

Added: February 15, 2026

Программные инструментальные средства для разработки мероприятий по снижению брака серийного производства

Yasnitsky L., Голдобин М. А., Мезенцев А. С., Прикладная математика и вопросы управления 2025 № 2 С. 99–116

Представлен обзор современных методов и основанных на них программных инструментах, применяемых для математического моделирования серийных производственных процессов с целью снижения брака и повышения качества производимых изделий. Перечисляются группы работ, нацеленных на обнаружение и классификацию дефектов, работ, в которых решаются задачи прогнозирования образования дефектов и определения значимости параметров, работ направленных на поиск оптимального сочетания технологических параметров изготовления изделий, ...

Added: February 15, 2026

Управление жизненным циклом информационных систем

Zaramenskikh E., М.: Юрайт, 2025.

В курсе рассматривается история и современное состояние информационных систем, а также все этапы их жизненного цикла — от подготовительного этапа до утилизации. Подробно разбирается теория и практика управления жизненным циклом информационных систем, самые разные методологии структурного анализа и моделирования бизнес-процессов, классические и гибкие процессы разработки информационных систем и предназначенные для этого программные инструменты, а также ...

Added: February 15, 2026

Total conditional complexity of certain objects

Vereshchagin N., Information and Computation 2026 Vol. 308 P. 1–12

The fine approach to measure information dependence is based on the total conditional complexity CT( y |x), which is defined as the minimal length of a total program that outputs y on the input x. It is known that the total conditional complexity can be much larger than the plain conditional complexity. Such strings x, y are defined ...

Added: February 14, 2026

Diffusion models for synthetic tabular data generation

Hushchyn M., Telesheva E., Doklady Mathematics 2025 No. 527 P. 388–399

he problem of generating high-quality synthetic data is crucial for many data science tasks. A generated dataset can cut the costs on the augmentation of the existing data with additional instances, for example, in physics, or help with its privacy protection, for instance, in banking. However, generating a tabular dataset is challenging, as the data ...

Added: February 12, 2026

Real-Bogus Classification for ZTF Data Releases: Two Approaches

Semenikhin n., Kornilov M., Lavrukhina A. et al., Communications in Computer and Information Science 2026 Vol. 2641 P. 211–219

We considered two fundamentally different approaches to real-bogus classification within the Zwicky Transient Facility survey data. The first approach is based on neural networks that take sequences of object images as input. The second approach uses features extracted from light curves and classical machine learning methods. Several models for both approaches were tested. Quality metrics ...

Added: February 12, 2026

Проблемы достоверности пользовательских оценок и отзывов на маркетплейсах: системный подход

Полежаева Я. В., Popov V., Бизнес-информатика 2025 Т. 19 № 24 С. 26–41

User ratings and reviews on marketplaces are subject to systematic distortions, creating serious risks for e-commerce participants and reducing the efficiency of market mechanisms. This study presents a comprehensive analysis of information distortion problems, covering the process from rating formation to its systematic accounting. The aim of the work is to systematize factors of information distortion on marketplaces and ...

Added: February 11, 2026

Development of a Language Model for Automated Classification of English-Language Scientific Articles by SRSTI Codes

V. V. Zunin, A. I. Afonin, V. I. Anoshin et al., Automatic Documentation and Mathematical Linguistics 2025 Vol. 59 No. 5 P. 287–293

The development of an artificial intelligence-based language model for classifying English-language scientific articles by SRSTI codes is described. This improves the processes of reviewing and indexing scientific publications. A pre-processed dataset of scientific articles was used for training and testing the models. An architecture for cascade classification was developed, and the performance of models with ...

Added: February 11, 2026

Generation of Synthesizable Verilog Code From Natural Language Specifications

Daniil S. Yashchenko, Aleksandr Y. Romanov, Artur A. Ziazetdinov et al., IEEE Access 2026 Vol. 14 P. 4990–5001

This study presents a method for generating synthesizable Verilog code for digital integrated circuits directly from natural-language specifications. The approach combines large language models with parameter-efficient fine-tuning (specifically, Low-Rank Adaptation and Quantized Low-Rank Adaptation) together with a specialized corpus of specification-code pairs that covers common design patterns and varying task complexity. The pipeline includes automated ...

Added: February 11, 2026

Application of MIMO technology in wideband millimeter range wireless communications systems

Tiraspolsky S.A., Ermolayev V. T., Flaksman A. G. et al., Radioelectronics and Communications Systems 2011 Vol. 54 P. 219–226

A concept of using MIMO technology in millimeter range wireless communications systems with orthogonal frequency division multiplexing is considered. The concept is based on dividing transmitting and receiving multi-element antenna arrays into separate sub-arrays with analogue radiation pattern shaping and on using two most powerful space sub-channels for information transmission. Sequence and structure of transmitted ...

Added: February 10, 2026

mmWave SVD-based beamformed MIMO communication systems

Sergey Tiraspolsky, Jeon B., Kim J. et al., Proceedings of the 7th IEEE conference on Consumer communications and networking (CCNC’2010) 2010 P. 834–838

This paper provides concept of data transmission protocol for millimeter wave (mmWave) wireless systems operating in Non-Line-of-Sight environment. This concept is designed to provide an effective and practical functioning of Multiple-Input Multiple-Output (MIMO) transmission mode that exploits combination of Singular Value Decomposition (SVD) of channel matrix and non-adaptive beamforming. The proposed protocol reduces complexity of ...

Added: February 10, 2026

Selective interference cancellation using Kalman filtering

Tiraspolsky S., Rubtsov A., Pudeyev A. et al., Proceedings of the 2006 3rd International Symposium on Wireless Communication Systems, IEEE 2006 P. 21–24

In present paper we have investigated a co-channel interference cancellation technique based on the tracking a limited number of strongest interferers only. With the assumption of synchronous base stations operation with overlapping but different training signals (pilots). Kalman filtering may be used for interfering channels estimation and further calculation of interference correlation matrix. This correlation ...

Added: February 10, 2026

Mobile WiMAX - Deployment Scenarios Performance Analysis

Tiraspolsky S., Malstev A., Rubtosv A. et al., Proceedings of the 2006 3rd International Symposium on Wireless Communication Systems, IEEE 2006 P. 353–357

In this paper, dynamic system level simulation methodology of mobile WiMAX (IEEE Std 802.16e) is described. The system level simulations scenarios (channel models, pathloss and shadow fading, sectorization, frequency reuse planning, system loading, etc) will be introduced. Evaluated performance of mobile WiMAX system such as signal-to-interference + noise ratio distributions, spectral efficiency and system outage ...

Added: February 10, 2026

Эффективность применения грассмановской диаграммообразующей схемы в MIMO системах связи

Тираспольский С.А., Червяков А. В., Труды Научной конференции по радиофизике, ННГУ, 2004 2004 С. 169–171

Диаграмообразование (ДО) в MIMO системах (multiple-input multiple-output systems), одновременно использующих несколько приемопередатчиков на обоих концах линии связи, является достаточно простым способом для повышения пропускной способности и увеличения ОСШ на приемном конце. Для этого в большинстве ранее предлагавшихся методов было необходимо знание на передатчике канальной матрицы или части ее SVD разложения, что требует значительной нагрузки на ...

Added: February 10, 2026

High-resolution capability of adaptive antenna arrays for communication systems

S.A. Tiraspolsky, Gerebryakov G. V., Журнал радиоэлектроники 2002 No. 7

In this paper we investigate comparison methods of different geometric configurations of adaptive antenna arrays for communications on purpose to estimate directions-of-arrival (DOA) of several external signals. The investigated antenna configurations have four elements and eleven wavelengths array size. The best high-resolution algorithm and the best array configuration are defined by numerical simulations. ...

Added: February 10, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026