Improving model parsimony and accuracy by modified greedy feature selection in digital soil mapping

•We applied the modified greedy feature selection (MGFS) for DSM regression.•The MGFS selected the most parsimonious model with 9 covariates.•The model based on MGFS had the best model accuracy and the lowest uncertainty.•The MGFS had the best computation efficiency summing up variable selection and...

Full description

Saved in:
Bibliographic Details
Published inGeoderma Vol. 432; p. 116383
Main Authors Zhang, Xianglin, Chen, Songchao, Xue, Jie, Wang, Nan, Xiao, Yi, Chen, Qianqian, Hong, Yongsheng, Zhou, Yin, Teng, Hongfen, Hu, Bifeng, Zhuo, Zhiqing, Ji, Wenjun, Huang, Yuanfang, Gou, Yuxuan, Richer-de-Forges, Anne C., Arrouays, Dominique, Shi, Zhou
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.04.2023
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We applied the modified greedy feature selection (MGFS) for DSM regression.•The MGFS selected the most parsimonious model with 9 covariates.•The model based on MGFS had the best model accuracy and the lowest uncertainty.•The MGFS had the best computation efficiency summing up variable selection and map prediction. In the context of increasing soil degradation worldwide, spatially explicit soil information is urgently needed to support decision-making for sustaining limited soil resources. Digital soil mapping (DSM) has been proven as an efficient way to deliver soil information from local to global scales. The number of environmental covariates used for DSM has rapidly increased due to the growing volume of remote sensing data, therefore variable selection is necessary to deal with multicollinearity and improve model parsimony. Compared with Boruta, recursive feature elimination (RFE), and variance inflation factor (VIF) analysis, we proposed the use of modified greedy feature selection (MGFS), for DSM regression. For this purpose, using quantile regression forest, 402 soil samples and 392 environmental covariates were used to map the spatial distribution of soil organic carbon density (SOCD) in Northeast and North China. The result showed that MGFS selected the most parsimonious model with only 9 covariates (e.g., brightness index, mean annual temperature), much lower than RFE (22 covariates), VIF (30 covariates), and Boruta (76 covariates). The repeated validation (50 times) showed that the MGFS derived model performed better (R2 of 0.60, LCCC of 0.74, RMSE of 13.80 t ha−1) than these using full covariates, Boruta, RFE and VIF (R2 of 0.48–0.57, LCCC of 0.64–0.72, RMSE of 14.24–15.79 t ha−1). Despite the similar performance of the uncertainty estimate (PICP), the model using MGFS and RFE had the lowest global uncertainty (0.86) as indicated by the uncertainty index. In addition, MGFS had the best computation efficiency when considering the steps of variable selection and map prediction. Given these advantages over Boruta, RFE and VIF, MGFS has a high potential in fine-resolution soil mapping practices, especially for these studies at a broad scale involving heavy computation on millions or billions of pixels.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0016-7061
1872-6259
DOI:10.1016/j.geoderma.2023.116383