Multiple Imputation Through XGBoost
The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation per...
Saved in:
Published in | Journal of computational and graphical statistics Vol. 33; no. 2; pp. 352 - 363 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Alexandria
Taylor & Francis
02.04.2024
Taylor & Francis Ltd |
Subjects | |
Online Access | Get full text |
ISSN | 1061-8600 1537-2715 |
DOI | 10.1080/10618600.2023.2252501 |
Cover
Loading…
Summary: | The use of multiple imputation (MI) is becoming increasingly popular for addressing missing data. Although some conventional MI approaches have been well studied and have shown empirical validity, they have limitations when processing large datasets with complex data structures. Their imputation performances usually rely on the proper specification of imputation models, and this requires expert knowledge of the inherent relations among variables. Moreover, these standard approaches tend to be computationally inefficient for medium and large datasets. In this article, we propose a scalable MI framework
mixgb
, which is based on XGBoost, subsampling, and predictive mean matching. Our approach leverages the power of XGBoost, a fast implementation of gradient boosted trees, to automatically capture interactions and nonlinear relations while achieving high computational efficiency. In addition, we incorporate subsampling and predictive mean matching to reduce bias and to better account for appropriate imputation variability. The proposed framework is implemented in an R package
mixgb
.
Supplementary materials
for this article are available online. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1061-8600 1537-2715 |
DOI: | 10.1080/10618600.2023.2252501 |