Analysis of Preprocessing Techniques for Missing Data in the Prediction of Sunflower Yield in Response to the Effects of Climate Change

Machine learning is often used to predict crop yield based on the sowing date and weather parameters in non-irrigated crops. In the context of climate change, regression algorithms can help identify correlations and plan agricultural activities to maximise production. In the case of sunflower crops,...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 13; no. 13; p. 7415
Main Authors	Călin, Alina Delia, Coroiu, Adriana Mihaela, Mureşan, Horea Bogdan
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.07.2023
Subjects	Abstract machines Agricultural production Agricultural research Algorithms Anomalies Artificial intelligence Climate change Climate effects Crop yield Crop yields Crops Data collection Datasets Global temperature changes Histograms imputation Interpolation Machine learning Methods Missing data outlier detection Outliers (statistics) Phenology Precipitation Precipitation (Meteorology) prediction Predictions Preprocessing Rain regression Regression analysis Sunflowers Support vector machines Weather United States > US
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Machine learning is often used to predict crop yield based on the sowing date and weather parameters in non-irrigated crops. In the context of climate change, regression algorithms can help identify correlations and plan agricultural activities to maximise production. In the case of sunflower crops, we identified datasets that are not very large and have many missing values, generating a low-performance regression model. In this paper, our aim is to study and compare several approaches for missing-value imputation in order to improve our regression model. In our experiments, we compare nine imputation methods, using mean values, similar values, interpolation (linear, spline, pad), and prediction (linear regression, random forest, extreme gradient boosting regressor, and histogram gradient boosting regression). We also employ four unsupervised outlier removal algorithms and their influence on the regression model: isolation forest, minimum covariance determinant, local outlier factor and OneClass-SVM. After preprocessing, the obtained datasets are used to build regression models using the extreme gradient boosting regressor and histogram gradient boosting regression, and their performance is compared. The evaluation of the models shows an increased R2 from 0.723 when removing instances with missing data, to 0.938 for imputation using Random Forest prediction and OneClass-SVM-based outlier removal.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app13137415