The importance of choosing a proper validation strategy in predictive models. A tutorial with real examples

Machine learning is the art of combining a set of measurement data and predictive variables to forecast future events. Every day, new model approaches (with high levels of sophistication) can be found in the literature. However, less importance is given to the crucial stage of validation. Validation...

Full description

Saved in:
Bibliographic Details
Published inAnalytica chimica acta Vol. 1275; p. 341532
Main Authors Lopez, Eneko, Etxebarria-Elezgarai, Jaione, Amigo, Jose Manuel, Seifert, Andreas
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 22.09.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Machine learning is the art of combining a set of measurement data and predictive variables to forecast future events. Every day, new model approaches (with high levels of sophistication) can be found in the literature. However, less importance is given to the crucial stage of validation. Validation is the assessment that the model reliably links the measurements and the predictive variables. Nevertheless, there are many ways in which a model can be validated and cross-validated reliably, but still, it may be a model that wrongly reflects the real nature of the data and cannot be used to predict external samples. This manuscript shows in a didactical manner how important the data structure is when a model is constructed and how easy it is to obtain models that look promising with wrong-designed cross-validation and external validation strategies. A comprehensive overview of the main validation strategies is shown, exemplified by three different scenarios, all of them focused on classification. [Display omitted] •We highlight the importance of cross-validation and external test set in prediction.•Model performance is not having best figures of merit in training but in testing.•Cross-validation in small datasets can deliver misleading models.•Calibration and validation must consider the inner and hierarchical data structure.•If independency in samples is not guaranteed, perform several validation procedures.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-3
content type line 23
ObjectType-Review-1
ISSN:0003-2670
1873-4324
DOI:10.1016/j.aca.2023.341532