kNNDM CV: k -fold nearest-neighbour distance matching cross-validation for map accuracy estimation

Random and spatial cross-validation (CV) methods are commonly used to evaluate machine-learning-based spatial prediction models, and the performance values obtained are often interpreted as map accuracy estimates. However, the appropriateness of such approaches is currently the subject of controvers...

Full description

Saved in:

Bibliographic Details
Published in	Geoscientific Model Development Vol. 17; no. 15; pp. 5897 - 5912
Main Authors	Linnenbrink, Jan, Milà, Carles, Ludwig, Marvin, Meyer, Hanna
Format	Journal Article
Language	English
Published	Katlenburg-Lindau Copernicus GmbH 07.08.2024 Copernicus Publications
Subjects	Accuracy Algorithms Computation Design Distance Distribution functions Environmental science Estimates Geographical distribution Machine learning Matching Methods Performance prediction Prediction models Sampling Simulation Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Random and spatial cross-validation (CV) methods are commonly used to evaluate machine-learning-based spatial prediction models, and the performance values obtained are often interpreted as map accuracy estimates. However, the appropriateness of such approaches is currently the subject of controversy. For the common case where no probability sample for validation purposes is available, in Milà et al. (2022) we proposed the nearest-neighbour distance matching (NNDM) leave-one-out (LOO) CV method. This method produces a distribution of geographical nearest-neighbour distances (NNDs) between test and training locations during CV that matches the distribution of NNDs between prediction and training locations. Hence, it creates predictive conditions during CV that are comparable to what is required when predicting a defined area. Although NNDM LOO CV produced largely reliable map accuracy estimates in our analysis, as a LOO-based method, it cannot be applied to the large datasets found in many studies. Here, we propose a novel k-fold CV strategy for map accuracy estimation inspired by the concepts of NNDM LOO CV: the k-fold NNDM (kNNDM) CV. The kNNDM algorithm tries to find a k-fold configuration such that the empirical cumulative distribution function (ECDF) of NNDs between test and training locations during CV is matched to the ECDF of NNDs between prediction and training locations. We tested kNNDM CV in a simulation study with different sampling distributions and compared it to other CV methods including NNDM LOO CV. We found that kNNDM CV performed similarly to NNDM LOO CV and produced reasonably reliable map accuracy estimates across sampling patterns. However, compared to NNDM LOO CV, kNNDM resulted in significantly reduced computation times. In an experiment using 4000 strongly clustered training points, kNNDM CV reduced the time spent on fold assignment and model training from 4.8 d to 1.2 min. Furthermore, we found a positive association between the quality of the match of the two ECDFs in kNNDM and the reliability of the map accuracy estimates. kNNDM provided the advantages of our original NNDM LOO CV strategy while bypassing its sample size limitations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1991-9603 1991-959X 1991-962X 1991-9603 1991-962X
DOI:	10.5194/gmd-17-5897-2024