Real-value negative selection over-sampling for imbalanced data set learning

•As an over-sampling method, RNSO does not require minority class instance available.•The generation of artificial minority class instances only relies on majority class.•RNSO can effectively avoid the generation of noisy instances and duplicated instances.•RNS can solve imbalanced classification ta...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 129; pp. 118 - 134
Main Authors Tao, Xinmin, Li, Qing, Ren, Chao, Guo, Wenjie, Li, Chenxi, He, Qing, Liu, Rui, Zou, Junrong
Format Journal Article
LanguageEnglish
Published New York Elsevier Ltd 01.09.2019
Elsevier BV
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•As an over-sampling method, RNSO does not require minority class instance available.•The generation of artificial minority class instances only relies on majority class.•RNSO can effectively avoid the generation of noisy instances and duplicated instances.•RNS can solve imbalanced classification task without any modification of classifier.•RNSO-based approach obtains better imbalanced classification results than other ones. The learning problem from imbalanced data set poses a major challenge in data mining community. Conventional machine learning algorithms show poor performance in dealing with the classification problems of imbalanced data set since they are originally designed to work with balanced class distribution. In this paper, we propose a new over-sampling technique, which uses the real-value negative selection (RNS) procedure to generate artificial minority data with no requirement of actual minority data available. The generated minority data with rare actual minority data if available are combined with the majority data as input to a bi-class classification approach for learning. In the experiments, we demonstrate the effectiveness of RNS in avoiding the problems often encountered by the existing over-sampling methods such as the generation of noisy instances and almost duplicated instances in the same clusters. Moreover, the extensive experimental results on the different imbalanced datasets from UCI repository and real-world imbalanced datasets show that when dealing with the classification of imbalanced datasets, the proposed hybrid approach can achieve better performance in terms of both G-Mean and F-Measure evaluation metrics as compared to the other existing imbalanced dataset classification techniques.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2019.04.011