A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification

A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minim...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 29; no. 9; pp. 4152 - 4165
Main Authors	Kang, Qi, Shi, Lei, Zhou, MengChu, Wang, XueSong, Wu, QiDi, Wei, Zhi
Format	Journal Article
Language	English
Published	United States IEEE 01.09.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Class imbalance Classification Computer applications data distribution Data processing Datasets Euclidean distance Euclidean geometry Iterative methods Kernel Learning algorithms Learning systems Machine learning Machine learning algorithms Optimization Random sampling Risk reduction Statistical sampling support vector machine (SVM) Support vector machines Training undersampling Weight
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A support vector machine (SVM) plays a prominent role in classic machine learning, especially classification and regression. Through its structural risk minimization, it has enjoyed a good reputation in effectively reducing overfitting, avoiding dimensional disaster, and not falling into local minima. Nevertheless, existing SVMs do not perform well when facing class imbalance and large-scale samples. Undersampling is a plausible alternative to solve imbalanced problems in some way, but suffers from soaring computational complexity and reduced accuracy because of its enormous iterations and random sampling process. To improve their classification performance in dealing with data imbalance problems, this work proposes a weighted undersampling (WU) scheme for SVM based on space geometry distance, and thus produces an improved algorithm named WU-SVM. In WU-SVM, majority samples are grouped into some subregions (SRs) and assigned different weights according to their Euclidean distance to the hyper plane. The samples in an SR with higher weight have more chance to be sampled and put to use in each learning iteration, so as to retain the data distribution information of original data sets as much as possible. Comprehensive experiments are performed to test WU-SVM via 21 binary-class and six multiclass publically available data sets. The results show that it well outperforms the state-of-the-art methods in terms of three popular metrics for imbalanced classification, i.e., area under the curve, F-Measure, and G-Mean.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2017.2755595