A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification
ICLR 2023 (oral) Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
28.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | ICLR 2023 (oral) Reliable application of machine learning-based decision systems in the wild
is one of the major challenges currently investigated by the field. A large
portion of established approaches aims to detect erroneous predictions by means
of assigning confidence scores. This confidence may be obtained by either
quantifying the model's predictive uncertainty, learning explicit scoring
functions, or assessing whether the input is in line with the training
distribution. Curiously, while these approaches all state to address the same
eventual goal of detecting failures of a classifier upon real-life application,
they currently constitute largely separated research fields with individual
evaluation protocols, which either exclude a substantial part of relevant
methods or ignore large parts of relevant failure sources. In this work, we
systematically reveal current pitfalls caused by these inconsistencies and
derive requirements for a holistic and realistic evaluation of failure
detection. To demonstrate the relevance of this unified perspective, we present
a large-scale empirical study for the first time enabling benchmarking
confidence scoring functions w.r.t all relevant methods and failure sources.
The revelation of a simple softmax response baseline as the overall best
performing method underlines the drastic shortcomings of current evaluation in
the abundance of publicized research on confidence scoring. Code and trained
models are at https://github.com/IML-DKFZ/fd-shifts. |
---|---|
DOI: | 10.48550/arxiv.2211.15259 |