In Defense of Metrics: Metrics Sufficiently Encode Typical Human Preferences Regarding Hydrological Model Performance

Building accurate rainfall–runoff models is an integral part of hydrological science and practice. The variety of modeling goals and applications have led to a large suite of evaluation metrics for these models. Yet, hydrologists still put considerable trust into visual judgment, although it is uncl...

Full description

Saved in:
Bibliographic Details
Published inWater resources research Vol. 59; no. 6; pp. e2022WR033918 - n/a
Main Authors Gauch, Martin, Kratzert, Frederik, Gilon, Oren, Gupta, Hoshin, Mai, Juliane, Nearing, Grey, Tolson, Bryan, Hochreiter, Sepp, Klotz, Daniel
Format Journal Article
LanguageEnglish
Published United States 01.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Building accurate rainfall–runoff models is an integral part of hydrological science and practice. The variety of modeling goals and applications have led to a large suite of evaluation metrics for these models. Yet, hydrologists still put considerable trust into visual judgment, although it is unclear whether such judgment agrees or disagrees with existing quantitative metrics. In this study, we tasked 622 experts to compare and judge more than 14,000 pairs of hydrographs from 13 different models. Our results show that expert opinion broadly agrees with quantitative metrics and results in a clear preference for a Machine Learning model over traditional hydrological models. The expert opinions are, however, subject to significant amounts of inconsistency. Nevertheless, where experts agree, we can predict their opinion purely from quantitative metrics, which indicates that the metrics sufficiently encode human preferences in a small set of numbers. While there remains room for improvement of quantitative metrics, we suggest that the hydrologic community should reinforce their benchmarking efforts and put more trust in these metrics. Key Points A group of 622 participants visually judge model simulations similarly to quantitative metrics and considers a Machine Learning model best Nash–Sutcliffe efficiency and Kling–Gupta efficiency are good predictors of overall and high‐flow hydrograph quality but low‐flow metrics are poor predictors of low‐flow quality We can discriminate hydrographs that experts consistently consider good or bad purely based on existing quantitative metrics
Bibliography:Martin Gauch partly worked as an intern at Google Research.
ISSN:0043-1397
1944-7973
DOI:10.1029/2022WR033918