Enhancing bankruptcy prediction efficiency using synthetic data

?

Enhancing bankruptcy prediction efficiency using synthetic data

Business Informatics. 2025. Vol. 19. No. 3. P. 22–47.

The firm financial insolvency prediction is crucial for investors, creditors, and regulators. However, access to high-quality, balanced data for model training is often limited due to privacy concerns, information scarcity, or financial reporting characteristics. This paper explores the potential of synthetic data generation techniques to increase minority class instances in unbalanced datasets and thereby potentially improve insolvency prediction models. The paper compares the performance of various imbalance reduction methods, including established methods such as, for example, the Synthetic Minority Oversampling Technique (SMOTE), with new synthetic data generation approaches based on Bayesian networks, marginal distributions, random forests, and generative adversarial networks. The performance of these methods is investigated in terms of their ability to improve classification performance such as Gini coefficient, geometric mean, false positive and false negative rate. The sample for the experiment is real financial performance of industrial SME companies in Finland for 2021. The results contribute to the growing body of knowledge on synthetic data generation and its application to address imbalanced datasets and improve predictive modelling in the financial industry and provide insights into the effectiveness of different synthetic data generation methods for sampling imbalanced datasets and improving the accuracy and reliability of firm insolvency prediction models.

Language: English

DOI

Text on another site

Автоматическое определение эмоционального состояния участников предметных разговоров по транскрипциям речи

Dvoynikova A., Мамонтов Д. Ю., Карпов А. А., В кн.: Альманах научных работ молодых ученых Университета ИТМОТ. 3.: Университет ИТМО, 2021. С. 63–68.

В работе проводятся экспериментальные исследования по определению уровня эмоциональных проявлений в текстовых транскрипциях базы данных K-EmoCon. Рассматривается влияние сбалансирования классов при обучении классификаторов на точность определения эмоций. В статье устанавливается базовый стандарт результатов по классификации уровня эмоций дикторов в текстовых транскрипциях. ...

Added: April 24, 2026

Assessing the Big Data Value: Approaches and Methods

Maltseva S. V., , in: Информатика и прикладная математика: Материалы X Международной научно-практической конференции (08.10 - 11.10.2025 г.)Т. 1: Сборник материалов часть 1.: Алматы: Институт информационных и вычислительных технологий КН МНВО РК, 2025.

Modern technological capabilities for obtaining data make them an important resource. Data analytics, development of products and services that actively use big data, implementation of the concept of data-driven organization make it necessary further development of methods for assessing the value, usefulness and cost of big data. Existing and promising methods, including the influence of ...

Added: March 3, 2026

Фундаментальная модель для временных рядов и как ее (не) обучать на синтетике

Temirkhanov A., Костромина А. М., Цымбой О. А. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 № S С. 485–494

The industry is rich in cases when we are required to make forecasting for large amounts of time series at once. However, we might be in a situation where we can not afford to train a separate model for each of them. Such issue in time series modeling remains without due attention. The remedy for ...

Added: February 24, 2026

AGDES: a Python package and an approach to generating synthetic data for differential equation solving with LLMs

Vladimir Zakharov, Anton Surkov, Sergei Koltcov, Procedia Computer Science 2025 Vol. 258 P. 1169–1178

The rapid development of large language models (LLMs), including their successful application to solving mathematical problems requiring complex reasoning, presents a potential avenue for using LLMs in solving differential equations. While these equations are currently being solved successfully both numerically and via the symbolic approach, it is possible that fine-tuned LLMs, if they treat solving ...

Added: August 21, 2025

Sim4Rec: Flexible and Extensible Simulator for Recommender Systems for Large-Scale Data

Anna Volodkevich, Ivanova V., Vasilev A. et al., , in: Advances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part IV.: Springer, 2025. P. 425–430.

Simulators for recommender systems are widely used for recommender systems performance evaluation and feedback loop effects analysis. Existing simulators often propose inflexible pipelines, are focused on narrow research tasks, or are not adapted to work with industrial large data volumes. To address these challenges, we developed the Sim4Rec simulation framework. The Sim4Rec models key aspects ...

Added: April 10, 2025

User response modeling in recommender systems: a survey

M. Shirokikh, Shenbin I., Alekseev A. et al., Journal of Mathematical Sciences 2024 Vol. 285 No. 2 P. 255–284

Over the last several decades, recommender systems have become an integral part of both our daily lives and the research frontier at machine learning. In this survey, we explore various approaches to developing simulators for recommendation systems, especially for modeling the user response function. We consider simple probabilistic models, approaches based on generative adversarial networks, ...

Added: November 24, 2024

MedSyn: LLM-based synthetic medical text generation framework

Kumichev G., Blinov P., Kuzkina Y. et al., , in: Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. European Conference, ECML PKDD 2024, Vilnius, Lithuania, September 9–13, 2024, Proceedings, Part X. LNCS, volume 14950.: Cham: Springer, 2024. P. 215–230.

Generating synthetic text addresses the challenge of data availability in privacy-sensitive domains such as healthcare. This study explores the applicability of synthetic data in real-world medical settings. We introduce MedSyn, a novel medical text generation framework that integrates large language models with a Medical Knowledge Graph (MKG). We use MKG to sample prior medical information for the prompt and generate synthetic ...

Added: November 22, 2024

The Role of Synthetic Data in Improving Neural Network Algorithms

Rabchevskiy A., Leonid N. Yasnitsky, , in: 2022 4th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency (SUMMA).: IEEE, 2022. P. 316–312.

Abstract— This review article describes synthetic data, its applications, and examples of improving neural network algorithms with synthetic data. Using these examples, we show the important role of synthetic data in the improvement of neural network algorithms and the development of artificial intelligence ...

Added: February 15, 2024

Creating and Using Synthetic Data for Neural Network Training, Using the Creation of a Neural Network Classifier of Online Social Network User Roles as an Example

Rabchevskiy A., Yasnitsky L., , in: Digital Science: DSIC 2021Vol. 381.: Switzerland: Birkhauser/Springer, 2022. P. 412–421.

Added: February 14, 2024

Исследование применения методов машинного обучения в задаче выявления мошеннических действий в отношении клиентов банка при подтверждении операции

Шелепова А. Н., Vorobyev I., В кн.: Межвузовская научно-техническая конференция студентов, аспирантов и молодых специалистов им. Е.В. Арменского 2023.: МИЭМ НИУ ВШЭ, 2023. С. 289–292.

На сегодняшний день выявление мошенничества в банковской сфере значительно затруднено из-за применения злоумышленниками методов социальной инженерии. Мошенники обманывают клиентов и убеждают перевести денежные средства на свои счета под различными предлогами. В целях противодействия угрозе банки блокируют операции и обращаются к клиенту для дополнительного подтверждения. Находясь под психологическим воздействием злоумышленников, клиенты подтверждают операции, несмотря на предупреждения ...

Added: February 13, 2024

Synthesis of Datasets for Neural Networks Based on Expert Knowledge

Rabchevskiy A., Ashikhmin E., Yasnitsky L., , in: Cyber-Physical Systems and Control II.: Springer, 2023. P. 535–544.

The problem of creating datasets for training and testing neural networks is described in the example of the task of social network management. A method of expert dataset synthesis based on experts’ knowledge of the subject area is proposed. The essence of the method lies in the fact that sets are generated randomly within the ...

Added: November 20, 2023

МОДЕЛИ ПРОГНОЗИРОВАНИЯ ВЕРОЯТНОСТИ БАНКРОТСТВА И ВОЗМОЖНОСТИ ИХ ПРИМЕНЕНИЯ ДЛЯ СТРОИТЕЛЬНЫХ КОМПАНИЙ

Voyko A. V., Учет. Анализ. Аудит 2021 Т. 8 № 1 С. 13–23

The paper examines some foreign and domestic methods of forecasting bankruptcy of enterprises in order to apply them in the largest construction organizations in Russia. The empirical basis of the study is the construction companies that are comparable in size, revenue, and market share. Their annual financial statements preceding the analysis are the information base ...

Added: November 2, 2021

Качество риск-менеджмента в банке: предпосылки возникновения финансовых проблем

Khasyanova S. Y., Цыганова В. В., Российский журнал менеджмента 2018 Т. 16 № 2 С. 187–204

The significant decrease in the number of banks in the Russian Federation observed recently and arising high social costs of liquidation and sanitation procedures underpin the need for continuous improvement of early-warning systems of bankruptcy. The aim of the article is to identify the key leading indicators of financial insolvency of banks. The study was ...

Added: October 9, 2018