Under-Sampling and Feature Selection Algorithms for S2SMLP

Imbalance learning is a hot topic in the data mining and machine learning domains. Data-level, algorithm-level and ensemble solutions are the three main methods proposed thus far to address imbalance learning. To alleviate the issues of data explosion and feature selection in multilayer perceptron b...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 8; pp. 191803 - 191814
Main Authors Liu, Shudong, Zhang, Ke
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Imbalance learning is a hot topic in the data mining and machine learning domains. Data-level, algorithm-level and ensemble solutions are the three main methods proposed thus far to address imbalance learning. To alleviate the issues of data explosion and feature selection in multilayer perceptron based on simultaneous two-sample representation(S2SMLP), in this paper, firstly, spectral clustering is exploited to select majority samples so as to construct a smaller training dataset for the classifier. We divide all majority samples into many clusters through spectral clustering, extract different numbers of representative samples from a cluster according to the size of each cluster, the average distance between the minority class and all samples of the cluster, then construct the training dataset of the classifier by combining these extracted samples from the majority class and all minority samples. Secondly, we propose a novel feature selection method based on the pairwise samples distance constraint, which considers the class labels of paired samples, select the features which push two similar samples closer together and pull two different samples farther apart. Finally, we conduct extensive experiments on 44 two-class imbalanced datasets and four high-dimensional DNA microarray datasets. The experimental results demonstrate that our proposed algorithms outperform some state-of-the-art algorithms in terms of <inline-formula> <tex-math notation="LaTeX">F\textrm {-measure},G\textrm {-mean} </tex-math></inline-formula> and AUC.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2020.3032520