Under-Sampling and Feature Selection Algorithms for S2SMLP

Imbalance learning is a hot topic in the data mining and machine learning domains. Data-level, algorithm-level and ensemble solutions are the three main methods proposed thus far to address imbalance learning. To alleviate the issues of data explosion and feature selection in multilayer perceptron b...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 8; pp. 191803 - 191814
Main Authors	Liu, Shudong, Zhang, Ke
Format	Journal Article
Language	English
Published	Piscataway IEEE 2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Classifiers Clustering Clustering algorithms Data mining Datasets DNA chips Feature extraction Feature selection imbalance learning information gain Machine learning Machine learning algorithms Multilayer perceptron Multilayer perceptrons spectral clustering Support vector machines Training Training data under-sampling
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Imbalance learning is a hot topic in the data mining and machine learning domains. Data-level, algorithm-level and ensemble solutions are the three main methods proposed thus far to address imbalance learning. To alleviate the issues of data explosion and feature selection in multilayer perceptron based on simultaneous two-sample representation(S2SMLP), in this paper, firstly, spectral clustering is exploited to select majority samples so as to construct a smaller training dataset for the classifier. We divide all majority samples into many clusters through spectral clustering, extract different numbers of representative samples from a cluster according to the size of each cluster, the average distance between the minority class and all samples of the cluster, then construct the training dataset of the classifier by combining these extracted samples from the majority class and all minority samples. Secondly, we propose a novel feature selection method based on the pairwise samples distance constraint, which considers the class labels of paired samples, select the features which push two similar samples closer together and pull two different samples farther apart. Finally, we conduct extensive experiments on 44 two-class imbalanced datasets and four high-dimensional DNA microarray datasets. The experimental results demonstrate that our proposed algorithms outperform some state-of-the-art algorithms in terms of <inline-formula> <tex-math notation="LaTeX">F\textrm {-measure},G\textrm {-mean} </tex-math></inline-formula> and AUC.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2020.3032520