Human-centered text mining: A new software system

S. Kuznetsov; Poelman J.; Elzinga P.; A. Neznanov; Dedene G.; Viaene S.

?

Human-centered text mining: A new software system

Lecture Notes in Computer Science. 2012. Vol. 7377 LNAI. P. 528–272.

Kuznetsov S., Poelman J., Elzinga P., Neznanov A., Dedene G., Viaene S.

In this paper we introduce a novel human-centered data mining software system which was designed to gain intelligence from unstructured textual data. The architecture takes its roots in several case studies which were a collaboration between the Amsterdam-Amstelland Police, GasthuisZusters Antwerpen (GZA) hospitals and KU Leuven. It is currently being implemented by bachelor and master students of Moscow Higher School of Economics. At the core of the system are concept lattices which can be used to interactively explore the data. They are combined with several other complementary statistical data analysis techniques such as Emergent Self Organizing Maps and Hidden Markov Models.

Priority areas: IT and mathematics

Language: English

Keywords: Formal Concept Analysis text mining concept lattices applications Software System

Is Canfield Right? On the Asymptotic Coefficients for the Maximum Antichain of Partitions and Related Counting Inequalities

Ignatov D. I., , in: 11th International Conference, AIST 2023, Yerevan, Armenia, September 28–30, 2023, Revised Selected Papers. Analysis of Images, Social Networks and Texts. Lecture Notes in Computer Science (LNCS, volume 14486).: Cham: Springer, 2024. P. 349 – 361.

This paper dates back to the asymptotic solutions of Rota’s problem on the size of maximum antichain in the set partition lattice by Canfield and Harper and others. The knowledge of asymptotic coefficients could pave the way to the asymptotic solutions of such problems as (maximal) antichain counting in partition lattices. In addition to our ...

Added: January 23, 2026

Cooperative games with fuzzy characteristic functions on concept lattices

Kemgne M. W., Njionou B. B., Ignatov D. I. et al., International Journal of Approximate Reasoning 2025 Vol. 186 Article 109527

This paper introduces cooperative games with transferable utilities and fuzzy characteristic functions on concept lattices. While previous works have independently addressed games with fuzzy payoffs and games restricted to structured coalition systems such as lattices, our approach combines both perspectives. We consider cooperative settings where coalition formation is constrained by a concept lattice structure, and ...

Added: January 23, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026

On syntactic concept lattice models for the Lambek calculus and infinitary action logic

Stepan L. Kuznetsov, Journal of Logic and Computation 2026 Vol. 36 No. 1 Article exaf078

The linguistic applications of the Lambek calculus suggest its semantics over algebras of formal languages. A straightforward approach to construct such semantics indeed yields a brilliant completeness theorem (Pentus 1995, Ann. Pure Appl. Logic, 75, 179–213). However, extending the calculus with extra operations ruins completeness. In order to mitigate this issue, Wurm (2017, J. Logic Lang. Inf., ...

Added: January 14, 2026

Implementing Transport Coding in OMNeT++ for Message Delay Reduction

Petrovanov I., Sergeev A., / Series Computer Science "arxiv.org". 2025. No. 2512.18332.

Transport coding reduces message delay in packet-switched networks by introducing controlled redundancy at the transport layer: original packets are encoded into coded packets, and the message is reconstructed after the first successful deliveries, effectively shifting latency from the maximum packet delay to the -th order statistic. We present a concise, reproducible discrete-event implementation of transport coding in OMNeT++, including ...

Added: December 24, 2025

Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

Меньшиков И. А., Бернадотт А. К., Elvimov N. S., / Series arXie "Statistical mechanics". 2025.

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image ...

Added: December 1, 2025

Determining the boundary of dynamical chaos in the generalized Chirikov map via machine learning

Чернышов Д. П., Satanin A., Shchur L., / Series arXiv "math". 2025.

We investigate the boundary separating regular and chaotic dynamics in the generalized Chirikov map, an extension of the standard map with phase-shifted secondary kicks. Lyapunov maps were computed across the parameter space (K,K(α, τ)) and used to train a convolutional neural network (ResNet18) for binary classification of dynamical regimes. The model reproduces the known critical ...

Added: November 21, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Diffusion on language model embeddings for protein sequence generation

Meshchaninov V., Strashnov, P., Shevtsov A. et al., / Cornell University. Серия CoRR, arXiv:2403.03726 "Computing Research Repository,". 2025.

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived ...

Added: October 5, 2025

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Shabalin A., Meshchaninov V., Vetrov D., / Series cs.CL, arXiv:2505.18853 "Computation and Language". 2025.

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic ...

Added: October 5, 2025

A Feature Engineering Framework for Computer Vision Based on Topological Data Analysis

Абрамов А. С., Chernyshev V. L., Mikhaylets E. et al., / Series Social Science Research Network "Social Science Research Network". 2025.

Computer vision is one of the most relevant modern research areas with broad practical applications. However, traditional solutions based on deep learning have signicant limitations and can be misleading. Topological data analysis, on the other hand, is a modern approach to solving similar problems using mathematically deterministic methods of algebraic topology that reduce the risk ...

Added: September 23, 2025

On the construction of frieze patterns from partitions of convex polygons by nonintersecting diagonals

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 07600.

We demonstrate in an elementary way how to construct a frieze pattern of width m-3 from a partition of a convex m-gon by not intersecting diagonals. ...

Added: September 17, 2025

On one property of Catalan numbers

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 20584.

We give a new proof of the following statement: the Catalan number C_n is divisible by n+2, if n is odd and n<> 3k+1. ...

Added: September 9, 2025

TabGraphs: A Benchmark and Strong Baselines for Learning on Graphs with Tabular Node Features

Bazhenov G., Platonov O., Prokhorenkova L., / Series arXiv:2409.14500 "arXiv:2409.14500 [cs.LG]". 2025.

Tabular machine learning is an important field for industry and science. In this f ield, table rows are typically treated as independent data samples, but additional information about the relations between these samples is sometimes available and can be used to improve predictive performance. Such information can be naturally modeled with a graph, hence tabular ...

Added: August 14, 2025

2022 IEEE Conference on Control Technology and Applications (CCTA)(Italy, Trieste, August 22-25, 2022)

IEEE, 2022.

Dear contributors, authors and participants of IEEE CCTA 2022 in Trieste, As Program Chair I would like to thank you for your contributions for this important conference concerning control applications in CSS IEEE. Although we had a very uncertain situation with COVID19 we had more than 360 papers submitted. Very early, we decided to held the ...

Added: August 9, 2025

Low Sets and Closure Properties of Counting Function Classes

Ivanashev Y., / Series Computer Science "arxiv.org". 2025.

Added: July 29, 2025

ComputAgeBench: Epigenetic Aging Clocks Benchmark

Dudkovskaia Anastasiia, / Series 005140 "Biorxiv". 2025.

The success of clinical trials of longevity drugs relies heavily on identifying integrative health and aging biomarkers, such as biological age. Epigenetic aging clocks predict the biological age of an individual using their DNA methylation profiles, commonly retrieved from blood samples. However, there is no standardized methodology to validate and compare epigenetic clock models as ...

Added: July 18, 2025

Substantive Criteria for Referring Statements from Texts to Events and Factors

I. V. Loginova, A. S. Piekalnits, E. A. Sabidaeva et al., Scientific and Technical Information Processing 2025 Vol. 52 No. 6 P. 738–751

The purpose of this paper is to advance and automate language models for extracting statements related to events and factors from text documents using the designed linguistic marker system. The paper presents the outcomes of text-mining models of events and factors extraction approbation on the example of analytical research in human potential, social sciences and ...

Added: July 18, 2025

An archaic reference-free method to jointly infer Neanderthal and Denisovan introgressed segments in modern human genomes

Planche L., Ilina A., Ávila-Arcos M. et al., / Series 005140 "Biorxiv". 2025.

Admixture between populations is a common feature of human history. Admixture events introduce new genetic variation that can fuel evolution. Characterizing the significance of admixture events on the evolution of a population across various species is of great interest to evolutionary geneticists. Local Ancestry Inference (LAI) methods infer genetic ancestry of an individual at a ...

Added: May 19, 2025

Proceedings of 8th International Scientific Conference-School for Young Scientists. Physical and Mathematical Modeling of Earth and Environment Processes—2022. (PMMEEP 2022)

Springer, 2023.

The book presents short papers of participants of the 8th International Scientific Conference-School for Young Scientists "Physical and Mathematical Modeling of Earth and Environment Processes" (Ishlinsky Institute for Problems in Mechanics of the Russian Academy of Sciences). The book includes theoretical and experimental studies of processes in the atmosphere, oceans, the lithosphere and their interaction; ...

Added: February 11, 2025

Syntactic concept lattice models for infinitary action logic

Stepan L. Kuznetsov, , in: Logic, Language, Information, and Computation: 30th International Workshop, WoLLIC 2024, Bern, Switzerland, June 10–13, 2024, ProceedingsVol. 14672: Lecture Notes in Computer Science.: Cham: Springer, 2024. P. 93–107.

Added: June 12, 2024

A Note on the Number of (Maximal) Antichains in the Lattice of Set Partitions

Ignatov D. I., , in: LNAI 14133: 28th International Conference on Conceptual Structures, ICCS 2023, Berlin, Germany, September 11–13, 2023, Proceedings. Graph-Based Representation and Reasoning.: Berlin: Springer, 2023. P. 56–69.

Set partitions and partition lattices are well-known objects in combinatorics and play an important role as a search space in many applied problems including ensemble clustering. Searching for antichains in such lattices is similar to that of in Boolean lattices. Counting the number of antichains in Boolean lattices is known as the Dedekind problem. In ...

Added: November 23, 2023

Литературное наследие XIX–XX веков: классификация растровых изображений для интеллектуального анализа и тематического моделирования корпуса рукописных текстов

Penskaja E., Khachaturyan L., Филологические науки. Научные доклады высшей школы 2023 № 5 С. 160–165

The article examines the current trends in workingwith digital forms of handwritten heritage on the history of Russian literature of the second half of the 19 — mid-20 century. The process of forming virtual archives is analyzed as a gradual accumulation of the “big date” of scientific research — an unrecognized information array of raster ...

Added: October 30, 2023

Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Ignatov D. I., Khvorykh G., Khrunin A. et al., , in: Recent Trends in Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020 Revised Supplementary ProceedingsVol. 12602.: Springer, 2021. P. 185–204.

© 2021, Springer Nature Switzerland AG.Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits. The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes. This can prevent the ...

Added: November 1, 2022