Stratified sampling for feature subspace selection in random forests for high dimensional data

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative feature...

Full description

Saved in:

Bibliographic Details
Published in	Pattern recognition Vol. 46; no. 3; pp. 769 - 787
Main Authors	Ye, Yunming, Wu, Qingyao, Zhexue Huang, Joshua, Ng, Michael K., Li, Xutao
Format	Journal Article
Language	English
Published	Kidlington Elsevier Ltd 01.03.2013 Elsevier
Subjects	Algorithms Applied sciences Classification Decision trees Detection, estimation, filtering, equalization, prediction Ensemble classifier Exact sciences and technology Forests High-dimensional data Image processing Information, signal and communications theory Neural networks Pattern recognition Random forests Sampling Signal and communications theory Signal processing Signal representation. Spectral analysis Signal, noise Stratified sampling Subspaces Telecommunications and information theory Stratified sampling High-dimensional data Decision trees Ensemble classifier Random forests Classification Biometrics Performance evaluation Nearest neighbour Automatic classification State of the art Random sampling Face recognition Image processing Subspace method Support vector machine Pattern recognition Algorithm Decision tree Signal classification Signal processing Feature extraction Automatic recognition Categorization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms. ► Propose a stratified sampling method to select feature subspaces for random forest. ► Introduce a stratification variable to divide features into strong and weak groups. ► Select features from each group to ensure each subspace contains useful features. ► The new method increases the random forest strength and maintains the correlation. ► Extensive experiments demonstrated the effectiveness of the new method.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2012.09.005