Assessing temporal data partitioning scenarios for estimating reference evapotranspiration with machine learning techniques in arid regions

•Temporal data partitioning strategies were adopted for ETo estimation.•Adopted strategies included both holdout and k-fold validation procedures.•Fixing monthly patterns as test data provide more insight on models performance.•K-fold validation produced promising results. Recently, data driven mach...

Full description

Saved in:

Bibliographic Details
Published in	Journal of hydrology (Amsterdam) Vol. 590; p. 125252
Main Authors	Hossein Kazemi, Mohammad, Shiri, Jalal, Marti, Pau, Majnooni-Heris, Abolfazl
Format	Journal Article
Language	English
Published	Elsevier B.V 01.11.2020
Subjects	Evapotranspiration Gene expression programming Hold out K-fold validation Hold out K-fold validation Evapotranspiration Gene expression programming
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•Temporal data partitioning strategies were adopted for ETo estimation.•Adopted strategies included both holdout and k-fold validation procedures.•Fixing monthly patterns as test data provide more insight on models performance.•K-fold validation produced promising results. Recently, data driven machine learning techniques has been widely applied for modeling reference evapotranspiration (ETo) values under various climatic conditions taking into account the different number of sites and available data length. A major issue with applying those models is the proper selection of training/testing data sets. Although some spatial generalization approaches have been recommended for this purpose, there are no specified recommended local (temporal) data partitioning strategies for machine learning based ETo estimation. The present study evaluates different hold-out and k-fold validation temporal data partitioning strategies when using gene expression programming (GEP) technique to estimate daily ETo in arid regions. The k-fold validation strategies considered annual, monthly and growing season period patterns as test data sets. Although commonly used partitioning of the available patterns into training and testing sets gave accurate results, statistical analysis showed that the results obtained through k-fold validation assessment were more reliable. A two-block partitioning strategy with chronologic data selection for training and testing provided the most accurate results among the hold-out procedures (mean scatter index (SI) value of 0.162). Fixing the extreme ETo values as training data set in hold-out procedures provided the less accurate results with considerable over/underestimation of the ETo values (mean SI value was 0.506). Results on the basis of hold-out approaches can be biased or only partially valid depending on selection of the test data from the time series. K-fold validation yielded the lowest over/underestimations of ETo values. Further, considering monthly patterns as minimum affordable test size produced higher error magnitudes among k-fold validation strategies, while considering the complete patterns of one growing season provided more accurate results among k-fold validation strategies.
ISSN:	0022-1694 1879-2707
DOI:	10.1016/j.jhydrol.2020.125252