Core clustering as a tool for tackling noise in cluster labels

de Amorim R. C.; Makarenkov V.; B. Mirkin

doi:10.1007/s00357-019-9303-4

Publications

?

Core clustering as a tool for tackling noise in cluster labels

Journal of Classification. 2020. Vol. 37. No. 1. P. 143–157.

de Amorim R. C., Makarenkov V., Mirkin B.

Real-world data sets often contain mislabelled entities. This can be particularly problematic if the data set is being used by a supervised classification algorithm at its learning phase. In this case, the accuracy of this classification algorithm, when applied to unlabelled data, is likely to suffer considerably. In this paper, we introduce a clustering-based method capable of reducing the number of mislabelled entities in data sets. Our method can be summarised as follows: (i) cluster the data set; (ii) select the entities that have the most potential to be assigned to correct clusters; (iii) use the entities of the previous step to define the core clusters and map them to the labels using a confusion matrix; (iv) use the core clusters and our cluster membership criterion to correct the labels of the remaining entities. We perform numerous experiments to validate our method empirically using k-nearest neighbour classifiers as a benchmark. We experiment with both synthetic and real-world data sets with different proportions of mislabelled entities. Our experiments demonstrate that the proposed method produces promising results. Thus, it could be used as a preprocessing data correction step of a supervised machine learning algorithm.

Research target: Computer Science

Priority areas: IT and mathematics

Keywords: k-means Minkowski metric clustering noise data

Publication based on the results of:

Decision making and data analysis in socio-economic and political systems (2020)

Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering

Mirkin B., Amorim R., Pattern Recognition 2012 Vol. 45 No. 3 P. 1061–1075

This paper represents another step in overcoming a drawback of K-Means, its lack of defense against noisy features, by using feature weights in the criterion. The Weighted K-Means method by Huang et al. is extended to the corresponding Minkowski metric for measuring distances. Under Minkowski metric the feature weights become intuitively appealing feature rescaling factors ...

Added: November 26, 2012

The Minkowski central partition as a pointer to a suitable distance exponent and consensus partitioning

Mirkin B., Amorim R., Makarenkov V. et al., Pattern Recognition 2017 Vol. 67 P. 62–72

The Minkowski weighted K-means (MWK-means) is a recently developed clustering algorithm capable of computing feature weights. The cluster-specific weights in MWK-means follow the intuitive idea that a feature with low variance should have a greater weight than a feature with high variance. The final clustering found by this algorithm depends on the selection of the ...

Added: March 30, 2017

Распределенная кластеризация данных о поведении пользователей веб-сайта для рекомендательных систем

Новиков О. В., Образование. Наука. Научные кадры 2013 № 2-2013 С. 164–167

This article represents a new technique for collaborative filtering based on pre-clustering of website usage data. The key idea involves using clustering methods to define groups of different users. ...

Added: April 6, 2013

Big Data Clustering in Cardeology Based on Modeling of Electrical Dynamics of the Heart in the form of Fermi-Pasta-Ulam Auto-Recurrence as a New Tool for the Study of Cardiac Activity

Shmid A., Новопашин М. А., Березин А. А. et al., Clinical Cardiology and Cardiovascular Interventions 2018 No. 1-10004 P. 1–8

The mass application of mobile cardiographs already leads to both explosive quantitative growth of the number of patients available for ECG study, registered daily outside the hospital (Big DATA in cardiology), and to the emergence of new qualitative opportunities for the study of long-term oscillatory processes (weeks, months, years) of the dynamics of the individual ...

Added: November 15, 2018

FPU recurrence electromagnetic spectrum as a possible physiotherapeutic tool

Shmid A., Zimina E., Новопашин М. А. et al., Physiotherapy Research and Reports, Open Access Text, UK 2018 No. 1(2) P. 1–4

The idea of forced external synchronization of the heart dynamics by the canonical FPU spectrum with a purpose to lower the rate of its desynchronization in some pathological cases has been hypothesized by the authors. It was concluded that a heart being a multi resonant distributed dynamic ion containing system may be resonantly influenced by ...

Added: August 2, 2019

Dynamics of cluster structures in a financial market network

Anton Kocheturov, Mikhail Batsyn, Panos M. Pardalos, Physica A: Statistical Mechanics and its Applications 2014 Vol. 413 P. 523–533

In the course of recent fifteen years the network analysis has become a powerful tool for studying financial markets. In this work we analyze stock markets of the USA and Sweden. We study cluster structures of a market network constructed from a correlation matrix of returns of the stocks traded in each of these markets. ...

Added: July 24, 2014

Кластерный анализ кардиологических данных

Зимина Е. Ю., Статистика и Экономика 2018 Т. 15 № 2 С. 30–37

The article includes the observation of the cluster analysis of medical data on the example of the cardiac data. One of the main effective and commonly used Data Mining methods that applied to the large amounts of information (for example, mathematical economics) are clustering methods: the search for signs of similarity between objects in the study of the subject area ...

Added: May 29, 2018

Кластеризация агентов в модели ограниченного соседства

Akopov A. S., Beklaryan A., Искусственные общества 2020 Т. 15 № 3 С. 1–11

This article presents a new approach to designing agent-based bounded neighbourhood models (the Schelling’s models). An original agent-based model in the AnyLogic system has been developed, which describes the segregation processes caused by the behaviour patterns of agent-individuals. There are examined various scenarios (environment characteristics) affecting the cluster structure of the spatial distribution of agents. Using the proposed bounded ...

Added: September 14, 2020

A-Wardpβ: Effective hierarchical clustering using the Minkowski metric and a fast k-means initialisation

de Amorim R. C., Makarenkov V., Mirkin B., Information Sciences 2016 Vol. 370-371 No. November P. 343–354

In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to ...

Added: September 7, 2016

Organizing Multimedia Data in Video Surveillance Systems Based on Face Verification with Convolutional Neural Networks

Sokolova Anastasiia, Kharchevnikova Angelina, Savchenko A., Lecture Notes in Computer Science 2018 Vol. 10716 P. 223–230

In this paper we propose the two-stage approach of organizing information in video surveillance systems. At first, the faces are detected in each frame and a video stream is split into sequences of frames with face region of one person. Secondly, these sequences (tracks) that contain identical faces are grouped using face verification algorithms and ...

Added: October 24, 2017

Advances in Computational Intelligence. IWANN 2019

Berlin: Springer, 2019.

This two-volume set LNCS 10305 and LNCS 10306 constitutes the refereed proceedings of the 15th International Work-Conference on Artificial Neural Networks, IWANN 2019, held at Gran Canaria, Spain, in June 2019. The 150 revised full papers presented in this two-volume set were carefully reviewed and selected from 210 submissions. The papers are organized in topical sections ...

Added: July 29, 2019

Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output ConvNet

Savchenko A., PeerJ Computer Science 2019 Vol. 5:e197 P. 1–26

This paper is focused on the automatic extraction of persons and their attributes (gender, year of born) from album of photos and videos. A two-stage approach is proposed in which, firstly, the convolutional neural network simultaneously predicts age/gender from all photos and additionally extracts facial representations suitable for face identification. Here the MobileNet is modified ...

Added: June 12, 2019

23rd International Symposium on Methodologies for Intelligent Systems - Proceedings

Birkhauser/Springer, 2017.

This book constitutes the proceedings of the 23rd International Symposium on Foundations of Intelligent Systems, ISMIS 2017, held in Warsaw, Poland, in June 2017. The 56 regular and 15 short papers presented in this volume were carefully reviewed and selected from 118 submissions. The papers include both theoretical and practical aspects of machine learning, data mining ...

Added: September 18, 2017

Cloud technologies in the problems of mathematical analysis of cardiological information

Zimina E., Shmid A., Новопашин М. А., Data Science. Information Technology and Nanotechnology 2018, CEUR workshop proceedings 2018 No. 2212 P. 112–118

The article includes the observation of the cloud services and technologies usage. The article contains a review of mathematical analysis of cardiac information using cloud technology, which produces storage, analysis and forecasting on the basis of owned data. In addition, the authors consider the possibility of integrating cloud technologies with external systems. The massive use of mobile devices for ...

Added: August 27, 2019

Formal Concept Analysis: 16th International Conference, ICFCA 2021, Strasbourg, France, June 29 – July 2, 2021, Proceedings

Springer, 2021.

This book constitutes the proceedings of the 16th International Conference on Formal Concept Analysis, ICFCA 2021, held in Strasbourg, France, in June/July 2021. The 14 full papers and 5 short papers presented in this volume were carefully reviewed and selected from 32 submissions. The book also contains four invited contributions in full paper length. The research part ...

Added: July 10, 2021

CEE-SECR '19 Proceedings of the 15th Central and Eastern European Software Engineering Conference in Russia

Silakov D., NY: ACM, 2019.

Added: November 20, 2019

Analysis and interpretation of imaging mass spectrometry data by clustering mass-to-charge images according to their spatial similarity

Alexandrov T., Chernyavsky I., Becker M. et al., Analytical Chemistry 2013 Vol. 85 No. 23 P. 11189–11195

Imaging mass spectrometry (imaging MS) has emerged in the past decade as a label-free, spatially resolved, and multipurpose bioanalytical technique for direct analysis of biological samples from animal tissue, plant tissue, biofilms, and polymer films. Imaging MS has been successfully incorporated into many biomedical pipelines where it is usually applied in the so-called untargeted mode-capturing spatial localization of a multitude of ions ...

Added: November 18, 2013

Capturing the right number of clusters with K-Means using the complementary criterion and affinity propagation

Токмаков М. А., Mirkin B., Journal of Classification 2017

K-Means is a most popular method for clustering. Yet it has some shortcomings such as the need in prior choice of the number of clusters K and a starting location of their centers. This paper pursues an approach of taking advantage of a reformulation of the square-error criterion based on a Pythagorean decomposition of the ...

Added: April 15, 2017

Braverman Readings in Machine Learning. Key Ideas from Inception to Current State

Heidelberg: Springer Publishing Company, 2018.

This state-of-the-art survey is dedicated to the memory of Emmanuil Markovich Braverman (1931-1977), a pioneer in developing the machine learning theory. The 12 revised full papers and 4 short papers included in this volume were presented at the conference "Braverman Readings in Machine Learning: Key Ideas from Inception to Current State" held in Boston, MA, USA, in ...

Added: September 11, 2018

Core concepts in data analysis: summarization, correlation, visualization (Undergraduate topics in Computer Science)

Mirkin B., L.: Springer, 2011.

This is a textbook in data analysis. Its contents are heavily influenced by the idea that data analysis should help in enhancing and augmenting knowledge of the domain as represented by the concepts and statements of relation between them. According to this view, two main pathways for data analysis are summarization, for developing and augmenting ...

Added: January 11, 2014

Тематические модели в задаче извлечения однословных терминов

М.А. Нокель, Н.В. Лукашевич, Программная инженерия 2014 № 3 С. 34–40

The paper describes the results of an experimental study of statistical topic models applied to the task of automatic single-word term extraction. The English part of the Europarl parallel corpus from the socio-political domain and the Russian articles taken from online banking magazines were used as target text collections. The experiments demonstrate that topic information ...

Added: October 1, 2014

Particle Simulation for Predicting Effective Properties of Short Fiber Reinforced Composites

Skoptsov K. A., Sheshenin S., Galatenko V. V. et al., International Journal of Applied Mechanics 2016 Vol. 8 No. 2 P. 1650016-01–1650016-18

We present a method for evaluating elastic properties of a composite material produced by molding a resin filled with short elastic fibers. A flow of the filled resin is simulated numerically using a mesh-free method. After that, assuming that spatial distribution and orientation of fibers are not significantly changed during polymerization, effective elastic moduli of ...

Added: May 22, 2016

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.). Вып. 18 (25)

М.: Издательский центр «Российский государственный гуманитарный университет», 2019.

Сборник включает 27 докладов международной конференции по компьютерной лингвистике и интеллектуальным технологиям «Диалог 2019», не вошедшие в ежегодник «Компьютерная лингвистика и интеллектуальные технологии», но рекомендованные Программным Комитетом к представлению на конференции. Для специалистов в области теоретической и прикладной лингвистики и интеллектуальных технологий. ...

Added: December 10, 2019

Algorithms and methods for solving scheduling problems and other extremum problems on large-scale graphs

Chernyshev S. V., Cherepanov E. A., Pankratiev E. V. et al., Journal of Mathematical Sciences 2005 Vol. 128 No. 6 P. 3487–3495

Added: January 27, 2014