RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Real-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the train...

Full description

Saved in:

Bibliographic Details
Published in	Machine learning Vol. 110; no. 11-12; pp. 3059 - 3093
Main Authors	Koziarski, Michał, Bellinger, Colin, Woźniak, Michał
Format	Journal Article
Language	English
Published	New York Springer US 01.12.2021 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Binary data Classification Cleaning Computer Science Control Machine Learning Mechatronics Natural Language Processing (NLP) Oversampling Recall Resampling Robotics Simulation and Modeling Special Issue: Foundations of Data Science Training Imbalanced data Radial basis functions Oversampling Machine learning Classification
Online Access	Get full text
ISSN	0885-6125 1573-0565
DOI	10.1007/s10994-021-06012-8

Cover

Loading…

More Information
Summary:	Real-world classification domains, such as medicine, health and safety, and finance, often exhibit imbalanced class priors and have asynchronous misclassification costs. In such cases, the classification model must achieve a high recall without significantly impacting precision. Resampling the training data is the standard approach to improving classification performance on imbalanced binary data. However, the state-of-the-art methods ignore the local joint distribution of the data or correct it as a post-processing step. This can causes sub-optimal shifts in the training distribution, particularly when the target data distribution is complex. In this paper, we propose Radial-Based Combined Cleaning and Resampling (RB-CCR). RB-CCR utilizes the concept of class potential to refine the energy-based resampling approach of CCR. In particular, RB-CCR exploits the class potential to accurately locate sub-regions of the data-space for synthetic oversampling. The category sub-region for oversampling can be specified as an input parameter to meet domain-specific needs or be automatically selected via cross-validation. Our 5 × 2 cross-validated results on 57 benchmark binary datasets with 9 classifiers show that RB-CCR achieves a better precision-recall trade-off than CCR and generally out-performs the state-of-the-art resampling methods in terms of AUC and G-mean.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-021-06012-8