A comparison of random forest variable selection methods for classification prediction modeling

•We compare performance for random forest variable selection methods.•VSURF or Jiang's method are preferable for most datasets.•varSelRF or Boruta perform well for data with >50 predictors.•Methods with conditional random forest usually have similar performance.•Type of methods, test- or per...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 134; pp. 93 - 101
Main Authors Speiser, Jaime Lynn, Miller, Michael E., Tooze, Janet, Ip, Edward
Format Journal Article
LanguageEnglish
Published New York Elsevier Ltd 15.11.2019
Elsevier BV
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We compare performance for random forest variable selection methods.•VSURF or Jiang's method are preferable for most datasets.•varSelRF or Boruta perform well for data with >50 predictors.•Methods with conditional random forest usually have similar performance.•Type of methods, test- or performance-based, is not likely to impact performance. Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
MEM: Funding acquisition, methodology, editing draft
JT: Funding acquisition, methodology, editing draft
EI: Funding acquisition, methodology, editing draft
Author contributions
JLS: Conceptualization, analysis, funding acquisition, methodology, writing original draft
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2019.05.028