Variable Importance Scores

There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values...

Full description

Saved in:

Bibliographic Details
Published in	Journal of Data Science Vol. 19; no. 4; pp. 569 - 592
Main Authors	Loh, Wei-Yin, Zhou, Peigen
Format	Journal Article
Language	English
Published	中華資料採礦協會 01.10.2021
Subjects	prediction classification and regression tree bias correction missing values
Online Access	Get full text

Cover

Loading…

More Information
Summary:	There are many methods of scoring the importance of variables in prediction of a response but not much is known about their accuracy. This paper partially fills the gap by introducing a new method based on the GUIDE algorithm and comparing it with 11 existing methods. For data without missing values, eight methods are shown to give biased scores that are too high or too low, depending on the type of variables (ordinal, binary or nominal) and whether or not they are dependent on other variables, even when all of them are independent of the response. Among the remaining four methods, only GUIDE continues to give unbiased scores if there are missing data values. It does this with a self-calibrating bias-correction step that is applicable to data with and without missing values. GUIDE also provides threshold scores for differentiating important from unimportant variables with 95 and 99 percent confidence. Correlations of the scores to the predictive power of the methods are studied in three real data sets. For many methods, correlations with marginal predictive power are much higher than with conditional predictive power.
ISSN:	1683-8602 1680-743X 1683-8602
DOI:	10.6339/21-JDS1023