Transforming approach for assessing the performance and applicability of rice arsenic contamination forecasting models based on regression and probability methods
Probability models are preferred over regression models recently in contamination evaluation but lacking proper performance comparison between two model types. Linear regression, logistic regression, XGBoost-based regression, and probability models were built considering soil arsenic and certain soi...
Saved in:
Published in | Journal of hazardous materials Vol. 424; no. Pt B; p. 127375 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Netherlands
Elsevier B.V
15.02.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Probability models are preferred over regression models recently in contamination evaluation but lacking proper performance comparison between two model types. Linear regression, logistic regression, XGBoost-based regression, and probability models were built considering soil arsenic and certain soil physicochemical properties of 287 samples to predict arsenic in rice grains. The outputs of all models were binarily classified uniformly for comparison. The complex algorithm-based models––XGBoost-based regression (R2 =0.046 ± 0.036) and probability models (cross-entropy = 0.697 ± 0.020)—did not surpass the simple linear regression (R2 =0.046 ± 0.031) and logistic regression models (cross-entropy = 0.694 ± 0.021). Accuracy, sensitivity, specificity, precision, and F1 score showed that the probability models exhibit no advantage on regression models, although the indicators above did not serve as proper scoring rules for the probability model. When discretizing the contaminant concentration in grains for probabilistic modeling, the limit concentration was considered as the splitting point but not the structure of the datasets, which would reduce the inherent advantage of the probability model. When predicting the contamination of crops, the probability model cannot eliminate the regression model, and simple but robust algorithm-based models are preferred when the quality and quantity of the dataset are undesirable.
[Display omitted]
•Regression and probability model were compared with binary output classification.•The performance of the complex algorithm-based model did not surpass the simple one.•The performance of the probability model depends on the discretizing process.•The probability model cannot replace the regression completely, as expected. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0304-3894 1873-3336 |
DOI: | 10.1016/j.jhazmat.2021.127375 |