A System for Knowledge Discovery in Big Dynamical Text Collections

S. Kuznetsov; A. Neznanov; J. Poelmans

?

A System for Knowledge Discovery in Big Dynamical Text Collections

Ch. 12. P. 81–87.

Software system Cordiet-FCA is presented, which is designed for knowledge discovery in big dynamic data collections, including texts in natural language. Cordiet-FCA allows one to compose ontology-controlled queries and outputs concept lattice, implication bases, association rules, and other useful concept-based artifacts. Efficient algorithms for data preprocessing, text processing, and visualization of results are discussed. Examples of applying the system to problems of medical diagnostics, criminal investigations are considered.

Language: English

Full text

Text on another site

Keywords: natural language processing data mining FCA (Formal Concept Analysis)Software Tool Visualization

In book

Proceedings, Workshop “What can FCA do for Artificial Intelligence?” of the ECAI 2012 conference

Kuznetsov S., Napoli A., Rudolph S. M.: CEUR Workshop Proceedings, 2012.

Recovery degree constrained equiconcept/pseudo-equiconcept reduction in symmetric formal contexts

Junyu B., Fei H., Huilin F. et al., International Journal of Approximate Reasoning 2025 Vol. 187 Article 109541

In Formal Concept Analysis (FCA), concept reduction serves as an important means of simplification. The application scenarios of concept reduction cover various aspects such as data mining, knowledge discovery, strategic decision-making, and rule learning. For symmetric formal contexts, a specialized class of concept reduction exists that can fully recover all knowledge. However, most existing concept ...

Added: December 1, 2025

Rewriting the Rules: LLMs Vs. Traditional ML in University Admissions

Chepikov I., Karpov I., , in: Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED.: Springer, 2025. P. 352 – 358.

Modern LLM models such as BERT, ChatGPT, DeepSeek have shown great potential in solving various tasks, including text classification, text generation, analysis and summary of documents. In this paper, we show that these models close to classical ML approaches based on decision trees not only in text processing, but also in processing classical tabular data ...

Added: September 4, 2025

Clustering with Stable Pattern Concepts

Dudyrev E., Mariia Zueva, Kuznetsov S. et al., , in: FCA4AI 2024: The 12th International Workshop "What can FCA do for Artificial Intelligence?", October 19 2024, Santiago de Compostela, SpainVol. 3911.: CEUR Workshop Proceedings, 2024. P. 47–58.

Clustering aims at finding disjoint groups of similar objects in data and is one major task in Machine Learning. It is also gaining more attention in Formal Concept Analysis community in these last years. This paper proposes an original approach to the clustering of complex data based on Formal Concept Analysis (FCA) and Pattern Structures. ...

Added: April 30, 2025

FCA4AI 2024: The 12th International Workshop "What can FCA do for Artificial Intelligence?", October 19 2024, Santiago de Compostela, Spain

CEUR Workshop Proceedings, 2024.

The eleven preceding editions of the FCA4AI Workshop showed that many researchers working in Articial Intelligence are deeply interested in a well-founded method for classication and data mining such as Formal Concept Analysis (see https://upriss.github.io/fca/fca.html). The FCA4AI Workshop Series started with ECAI 2012 (Montpellier) and the last edition was co-located with IJCAI 2023 (Macao, China). The ...

Added: April 29, 2025

Audio-Visual Speech Recognition In-The-Wild: Multi-Angle Vehicle Cabin Corpus and Attention-Based Method

Axyonov Alexandr, Ryumin Dmitry, Ivanko D. et al., , in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2024).: IEEE, 2024. P. 8195–8199.

Audio-visual speech recognition (AVSR) gains increasing attention as an important part of human-machine interaction. However, the publicly available corpora are limited, particularly in driving conditions with prevalent background noise. Research so far has been collected in constrained environments, and thus cannot reflect the true performance of AVSR systems in real-world scenarios. Moreover, data for languages ...

Added: March 6, 2025

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part X. LNCS, volume 14950

Cham: Springer, 2024.

This multi-volume set, LNAI 14941 to LNAI 14950, constitutes the refereed proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2024, held in Vilnius, Lithuania, in September 2024. ...

Added: November 22, 2024

Cross-country analysis of science, technology and innovation policies: non-covid-19 related and Covid-19 specific STI policies in OECD countries

Russo M., Pavone P., Meissner D. et al., Quality and Quantity 2024 P. 1–25

In OECD countries, Science, Technology and Innovation (STI) policies were seen as key aspects of coping with the Covid-19 pandemic. Now that the pandemic is over, identifying which policy mix portfolios characterised countries in terms of their non-Covid-19 related and Covid-19 specific STI policies fills a knowledge gap on changes in STI policies induced by ...

Added: September 27, 2024

Analyzing the Robustness of Vision & Language Models

Shirnin A., Andreev N., Potapova S. et al., IEEE/ACM Transactions on Speech and Language Processing 2024 Vol. 32 P. 2751–2763

We present an approach to evaluate the robustness of pre-trained vision and language (V&L) models to noise in input data. Given a source image/text, we perturb it using standard computer vision (CV) / natural language processing (NLP) techniques and feed it to a V&L model. To track performance changes, we explore the problem of visual ...

Added: July 19, 2024

Parameter-Efficient Tuning of Transformer Models for Anglicism Detection and Substitution in Russian

Daniil Lukichev, Kryanina Darya, Anastasia Bystrova et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023. P. 295–306.

Added: April 25, 2024

2023 IEEE International Conference on Data Mining Workshops (ICDMW) 1–4 December 2023, Shanghai, China

Shanghai: IEEE Computer Society, 2023.

The IEEE International Conference on Data Mining (ICDM) has established itself as the world’s premier research conference in data mining. It provides an international forum for presentation of original research results, as well as exchange and dissemination of innovative and practical development experiences. The conference covers all aspects of data mining, including algorithms, software, systems, ...

Added: March 20, 2024

Поиск закономерностей и важности признаков в данных виктимизационного опроса

D'yakonov A., Головина А. М., Прикладная математика и информатика 2023 Т. 61 № 74 С. 91–108

A methodology for finding patterns by solving machine learning problems with a teacher is described and applied to the analysis of national victimization survey data. Important features for machine learning models, interesting patterns and inconsistencies in the data are found. Experiments on estimating feature importance using different methods are described. ...

Added: March 18, 2024

Explainable Document Classification via Pattern Structures

Sergei O. Kuznetsov, Parakal E. G., Lecture Notes in Networks and Systems 2023 Vol. 776 P. 423–434

Inherently explainable Machine Learning (ML) models are able to provide explanations for their predictions by virtue of their construction. The explanations of a ML model are more comprehensible if they are expressed in terms of its input features. Our paper proposes an inherently explainable pipeline for document classification using pattern structures and Abstract Meaning Representation ...

Added: February 5, 2024

Business Process Management Workshops. BPM 2023 International Workshops, Utrecht, The Netherlands, September 11–15, 2023, Revised Selected Papers

Switzerland: Springer, 2024.

This book constitutes revised papers from the International Workshops held at the 21st International Conference on Business Process Management, BPM 2023, in Utrecht, The Netherlands, during September 2023. Papers from the following workshops are included: • 7th International Workshop on Artificial Intelligence for Business Process Management (AI4BPM 2023) • 7th International Workshop on Business Processes Meet Internet-of-Things (BP-Meet-IoT ...

Added: January 17, 2024

Проект Chekhov Digital: задачи и проблемы реализации семантической разметки текстов (на примере рассказа А. П. Чехова «Смерть чиновника»)

Северина Е. М., Ларионова М. Ч., Litera 2023 № 10 С. 211–222

The article considers a model of preparation of machine-readable (semantic) markup of texts for the Chekhov Digital project on the example of philological interpretation of individual significant elements of A. P. Chekhov's story "Death of an Official" and presentation of this information explicitly based on the standards of digital publication Text Encoding Initiative (TEI/XML). Based ...

Added: January 12, 2024

РАЗРАБОТКА СИСТЕМЫ ГЕНЕРАЦИИ ПОВСЕДНЕВНЫХ ДИАЛОГОВ НА РУССКОМ ЯЗЫКЕ: ПИЛОТНОЕ ИССЛЕДОВАНИЕ

Кругликова В. Г., В кн.: Анализ речи: теоретические и прикладные аспекты: сборник научных статей.: [б.и.], 2023.

The article presents a comparative analysis of various language models used to generate texts and evaluates their effectiveness for the task of generating conversational speech. There are such models as GPT-3, BERT, LSTM involved in the comparative analysis. This study is part of a project of developing a system for generating dialogues in Russian. The ...

Added: December 10, 2023

Investor sentiment and the NFT hype index: to buy or not to buy?

Baklanova V., Kurkin A., Teplova T., China Finance Review International 2024 Vol. 14 No. 3 P. 522–548

Purpose – The primary objective of this research is to provide a precise interpretation of the constructed machine learning model and produce definitive summaries that can evaluate the influence of investor sentiment on the overall sales of non-fungible token (NFT) assets. To achieve this objective, the NFT hype index was constructed as well as several approaches of ...

Added: December 10, 2023

Think about what you’ve learned: анализ тональности для моделирования пользовательского опыта в сфере онлайн-образования

Kirina M., Человек: образ и сущность. Гуманитарные аспекты 2024 № 2(58) С. 176–204

The article focuses on the application of opinion mining techniques to evaluate user experience on the Hyperskill educational platform, using Python, Java, and Kotlin programming projects as the basis of analysis. The study utilizes sentiment analysis and keyword extraction methods to gauge users' attitudes towards the platform, learning process, and topics covered. To achieve this, ...

Added: December 9, 2023

Комбинирование методов для извлечения терминов из научно-технического текста

Bolshakova E. I., Семак В. В., Интеллектуальные системы. Теория и приложения 2021 Т. 25 № 4 С. 239–242

An approach to automatic extraction of terms from an individual scientific text is reported, which combines known methods: linguistic patterns, statistical terminological measures, methods of graph ranking. The combined methods and stages for extracting, selection and ranking of terms are described, which are implemented for processing documents in Russian. The results of experiments on extracting ...

Added: November 23, 2023

Multimodal Discourse Trees in Forensic Linguistics

Galitsky B., Ilvovsky D., Goncharova E., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22.: [б.и.], 2023.

We extend the concept of a discourse tree (DT) in the discourse representation of text towards data of various forms and natures. The communicative DT to include speech act theory, extended DT to ascend to the level of multiple documents, entity DT to track how discourse covers various entities were defined previously in computational linguistics, we now proceed ...

Added: November 10, 2023

Сентимент-анализ как метод исследования информационной повестки и общественного мнения (на примере СМИ и социальных сетей КНР)

Анташева М. С., Lobanova P., Isaeva J. K. et al., Социология: методология, методы, математическое моделирование 2023 № 57 С. 7–41

The information agenda broadcast by Chinese media resources is a source of up-to-date data on public opinion on key issues of social welfare. Due to the technical peculiarities of the organization of Chinese websites and the need to attract additional resources for automatic processing (parsing) of texts in Chinese, this topic is not widely represented in domestic and foreign studies. The ...

Added: November 9, 2023

Classification of Short Scientific Texts

I. K. Kusakin, Fedorets O. V., A. Y. Romanov, Scientific and Technical Information Processing 2023 Vol. 50 No. 3 P. 176–183

This paper discusses modern approaches to natural language processing and the application of machine learning models to the task of classifying short scientific texts in Russian. This study is devoted to the analysis of methods for vectorization of textual information, selection of a model for scientific paper clas- sification, and training of linguistic model BERT ...

Added: November 4, 2023