A comparison of random forest variable selection methods for classification prediction modeling

•We compare performance for random forest variable selection methods.•VSURF or Jiang's method are preferable for most datasets.•varSelRF or Boruta perform well for data with >50 predictors.•Methods with conditional random forest usually have similar performance.•Type of methods, test- or per...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 134; pp. 93 - 101
Main Authors	Speiser, Jaime Lynn, Miller, Michael E., Tooze, Janet, Ip, Edward
Format	Journal Article
Language	English
Published	New York Elsevier Ltd 15.11.2019 Elsevier BV
Subjects	Classification Data acquisition Datasets Feature reduction Identification methods Machine learning Methods Random forest Test procedures Variable selection Random forest Feature reduction Classification Variable selection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•We compare performance for random forest variable selection methods.•VSURF or Jiang's method are preferable for most datasets.•varSelRF or Boruta perform well for data with >50 predictors.•Methods with conditional random forest usually have similar performance.•Type of methods, test- or performance-based, is not likely to impact performance. Random forest classification is a popular machine learning method for developing prediction models in many research settings. Often in prediction modeling, a goal is to reduce the number of variables needed to obtain a prediction in order to reduce the burden of data collection and improve efficiency. Several variable selection methods exist for the setting of random forest classification; however, there is a paucity of literature to guide users as to which method may be preferable for different types of datasets. Using 311 classification datasets freely available online, we evaluate the prediction error rates, number of variables, computation times and area under the receiver operating curve for many random forest variable selection methods. We compare random forest variable selection methods for different types of datasets (datasets with binary outcomes, datasets with many predictors, and datasets with imbalanced outcomes) and for different types of methods (standard random forest versus conditional random forest methods and test based versus performance based methods). Based on our study, the best variable selection methods for most datasets are Jiang's method and the method implemented in the VSURF R package. For datasets with many predictors, the methods implemented in the R packages varSelRF and Boruta are preferable due to computational efficiency. A significant contribution of this study is the ability to assess different variable selection techniques in the setting of random forest classification in order to identify preferable methods based on applications in expert and intelligent systems.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 MEM: Funding acquisition, methodology, editing draft JT: Funding acquisition, methodology, editing draft EI: Funding acquisition, methodology, editing draft Author contributions JLS: Conceptualization, analysis, funding acquisition, methodology, writing original draft
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2019.05.028