Rip van Winkle's Razor: A Simple Estimate of Overfit to Test Data
Traditional statistics forbids use of test data (a.k.a. holdout data) during training. Dwork et al. 2015 pointed out that current practices in machine learning, whereby researchers build upon each other's models, copying hyperparameters and even computer code -- amounts to implicitly training o...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.02.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Traditional statistics forbids use of test data (a.k.a. holdout data) during
training. Dwork et al. 2015 pointed out that current practices in machine
learning, whereby researchers build upon each other's models, copying
hyperparameters and even computer code -- amounts to implicitly training on the
test set. Thus error rate on test data may not reflect the true population
error. This observation initiated {\em adaptive data analysis}, which provides
evaluation mechanisms with guaranteed upper bounds on this difference. With
statistical query (i.e. test accuracy) feedbacks, the best upper bound is
fairly pessimistic: the deviation can hit a practically vacuous value if the
number of models tested is quadratic in the size of the test set.
In this work, we present a simple new estimate, {\em Rip van Winkle's Razor}.
It relies upon a new notion of \textquotedblleft information
content\textquotedblright\ of a model: the amount of information that would
have to be provided to an expert referee who is intimately familiar with the
field and relevant science/math, and who has been just been woken up after
falling asleep at the moment of the creation of the test data (like
\textquotedblleft Rip van Winkle\textquotedblright\ of the famous fairy tale).
This notion of information content is used to provide an estimate of the above
deviation which is shown to be non-vacuous in many modern settings. |
---|---|
DOI: | 10.48550/arxiv.2102.13189 |