Data leakage inflates prediction performance in connectome-based machine learning models

Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an inco...

Full description

Saved in:

Bibliographic Details
Published in	Nature communications Vol. 15; no. 1; pp. 1829 - 15
Main Authors	Rosenblatt, Matthew, Tejavibulya, Link, Jiang, Rongtao, Noble, Stephanie, Scheinost, Dustin
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 28.02.2024 Nature Publishing Group Nature Portfolio
Subjects	59/36 59/57 631/378 631/378/2649 639/705/1042 Brain - diagnostic imaging Connectome - methods Data integrity Datasets Feature selection Humanities and Social Sciences Humans Leakage Learning algorithms Machine Learning Magnetic Resonance Imaging - methods Medical imaging Modelling multidisciplinary Neuroimaging Neuroimaging - methods Performance prediction Phenotypes Prediction models Reproducibility of Results Science Science (multidisciplinary) Structure-function relationships
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling. The effects of data leakage on predictive models in neuroimaging studies are not well understood. Here, the authors show that data leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have more minor effects.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-024-46150-w