Transforming approach for assessing the performance and applicability of rice arsenic contamination forecasting models based on regression and probability methods

Probability models are preferred over regression models recently in contamination evaluation but lacking proper performance comparison between two model types. Linear regression, logistic regression, XGBoost-based regression, and probability models were built considering soil arsenic and certain soi...

Full description

Saved in:

Bibliographic Details
Published in	Journal of hazardous materials Vol. 424; no. Pt B; p. 127375
Main Authors	Zhao, Chen, Yang, Jun, Shi, Huading, Chen, Tongbin
Format	Journal Article
Language	English
Published	Netherlands Elsevier B.V 15.02.2022
Subjects	Arsenic Arsenic - analysis Oryza Probability Probability forecasting Regression Rice grain Soil Soil Pollutants - analysis Regression Soil Rice grain Arsenic Probability forecasting
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Probability models are preferred over regression models recently in contamination evaluation but lacking proper performance comparison between two model types. Linear regression, logistic regression, XGBoost-based regression, and probability models were built considering soil arsenic and certain soil physicochemical properties of 287 samples to predict arsenic in rice grains. The outputs of all models were binarily classified uniformly for comparison. The complex algorithm-based models––XGBoost-based regression (R2 =0.046 ± 0.036) and probability models (cross-entropy = 0.697 ± 0.020)—did not surpass the simple linear regression (R2 =0.046 ± 0.031) and logistic regression models (cross-entropy = 0.694 ± 0.021). Accuracy, sensitivity, specificity, precision, and F1 score showed that the probability models exhibit no advantage on regression models, although the indicators above did not serve as proper scoring rules for the probability model. When discretizing the contaminant concentration in grains for probabilistic modeling, the limit concentration was considered as the splitting point but not the structure of the datasets, which would reduce the inherent advantage of the probability model. When predicting the contamination of crops, the probability model cannot eliminate the regression model, and simple but robust algorithm-based models are preferred when the quality and quantity of the dataset are undesirable. [Display omitted] •Regression and probability model were compared with binary output classification.•The performance of the complex algorithm-based model did not surpass the simple one.•The performance of the probability model depends on the discretizing process.•The probability model cannot replace the regression completely, as expected.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0304-3894 1873-3336
DOI:	10.1016/j.jhazmat.2021.127375