The effects of data leakage on connectome-based machine learning models

Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the v...

Full description

Saved in:
Bibliographic Details
Published inbioRxiv
Main Authors Rosenblatt, Matthew, Tejavibulya, Link, Jiang, Rongtao, Noble, Stephanie, Scheinost, Dustin
Format Journal Article Paper
LanguageEnglish
Published United States Cold Spring Harbor Laboratory Press 28.12.2023
Cold Spring Harbor Laboratory
Edition1.2
Subjects
Online AccessGet full text
ISSN2692-8205
2692-8205
DOI10.1101/2023.06.09.544383

Cover

Abstract Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.
AbstractList Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.
Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.
Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.Competing Interest StatementThe authors have declared no competing interest.Footnotes* The discussion section was updated with more potential solutions; the methods section was updated to describe in more detail why we selected the particular types of leakage in this work; a new section analyzing family leakage was added.
Author Tejavibulya, Link
Jiang, Rongtao
Rosenblatt, Matthew
Noble, Stephanie
Scheinost, Dustin
Author_xml – sequence: 1
  givenname: Matthew
  orcidid: 0000-0002-3894-6198
  surname: Rosenblatt
  fullname: Rosenblatt, Matthew
  organization: Department of Biomedical Engineering, Yale University, New Haven, CT
– sequence: 2
  givenname: Link
  surname: Tejavibulya
  fullname: Tejavibulya, Link
  organization: Interdepartmental Neuroscience Program, Yale University, New Haven, CT
– sequence: 3
  givenname: Rongtao
  surname: Jiang
  fullname: Jiang, Rongtao
  organization: Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT
– sequence: 4
  givenname: Stephanie
  surname: Noble
  fullname: Noble, Stephanie
  organization: Department of Psychology, Northeastern University, Boston, MA
– sequence: 5
  givenname: Dustin
  surname: Scheinost
  fullname: Scheinost, Dustin
  organization: Department of Statistics & Data Science, Yale University, New Haven, CT
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38234740$$D View this record in MEDLINE/PubMed
BookMark eNpd0E1Lw0AQBuBFFK21P8CLBLx4SZz96CZ7lKJVELzUc5hkZ21qsqvZRvTfG_ETTzMwDzPMe8h2ffDE2DGHjHPg5wKEzEBnYLK5UrKQO2witBFpIWC--6c_YLMYNwAgjOYyV_vsQBZCqlzBhC1Xa0rIOaq3MQkusbjFpCV8xAdKgk_q4P04Cx2lFUaySYf1uvH0YXrf-IekC5baeMT2HLaRZl91yu6vLleL6_T2bnmzuLhNK8G1TLVyVru5xZwkGmM5qcJZIUGjLhCh0qKu0eSgOaEhx1WeWydrZVyFDlFO2dnn3qoJ_WvzUj71TYf9W_kRRwm6BFN-xvFLn_rwPFDcll0Ta2pb9BSGWArDtYICRj1lp__oJgy9Hx8ZFeQgeA58VCdfaqg6sj-nv9OU78ESdqI
Cites_doi 10.1007/978-3-030-67670-4_1
10.1101/2022.12.31.522374
10.1016/j.bpsc.2022.12.006
10.1145/3357713.3384290
ContentType Journal Article
Paper
Copyright 2023. This article is published under http://creativecommons.org/licenses/by-nd/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
2023, Posted by Cold Spring Harbor Laboratory
Copyright_xml – notice: 2023. This article is published under http://creativecommons.org/licenses/by-nd/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
– notice: 2023, Posted by Cold Spring Harbor Laboratory
DBID NPM
8FE
8FH
ABUWG
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
7X8
FX.
DOI 10.1101/2023.06.09.544383
DatabaseName PubMed
ProQuest SciTech Collection
ProQuest Natural Science Journals
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
Biological Science Database
ProQuest Central
Natural Science Collection
ProQuest One Community College
ProQuest Central
ProQuest Central Student
SciTech Premium Collection
ProQuest Biological Science Collection
Biological Science Database
ProQuest Central Premium
ProQuest One Academic
Publicly Available Content Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
MEDLINE - Academic
bioRxiv
DatabaseTitle PubMed
Publicly Available Content Database
ProQuest Central Student
ProQuest One Academic Middle East (New)
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
MEDLINE - Academic
DatabaseTitleList PubMed
MEDLINE - Academic
Publicly Available Content Database

Database_xml – sequence: 1
  dbid: FX.
  name: bioRxiv
  url: https://www.biorxiv.org/
  sourceTypes: Open Access Repository
– sequence: 2
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 3
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Biology
EISSN 2692-8205
Edition 1.2
ExternalDocumentID 2023.06.09.544383v2
38234740
Genre Preprint
Working Paper/Pre-Print
GrantInformation_xml – fundername: NIDA NIH HHS
  grantid: U01 DA041134
– fundername: NIMH NIH HHS
  grantid: RC2 MH089924
– fundername: NIDA NIH HHS
  grantid: U01 DA041028
– fundername: NIDA NIH HHS
  grantid: U01 DA051016
– fundername: NIDA NIH HHS
  grantid: U24 DA041123
– fundername: NIDA NIH HHS
  grantid: U01 DA051039
– fundername: NIDA NIH HHS
  grantid: U01 DA041048
– fundername: NIDA NIH HHS
  grantid: U24 DA041147
– fundername: NIDA NIH HHS
  grantid: U01 DA050989
– fundername: NIMH NIH HHS
  grantid: R01 MH121095
GroupedDBID NPM
8FE
8FH
ABUWG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
NQS
PHGZM
PHGZT
PIMPY
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PROAC
RHI
7X8
PUEGO
FX.
ID FETCH-LOGICAL-b2163-64fd6f5da7e3a99d1e48fd2306a68aa0b62cca97061ea9ef1477df3c49fbafaa3
IEDL.DBID FX.
ISSN 2692-8205
IngestDate Tue Jan 07 18:59:59 EST 2025
Thu Sep 04 20:45:46 EDT 2025
Fri Jul 25 09:18:55 EDT 2025
Thu Apr 03 07:10:16 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
License This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-b2163-64fd6f5da7e3a99d1e48fd2306a68aa0b62cca97061ea9ef1477df3c49fbafaa3
Notes SourceType-Working Papers-1
ObjectType-Working Paper/Pre-Print-1
content type line 50
ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-3
content type line 23
Competing Interest Statement: The authors have declared no competing interest.
ORCID 0000-0002-3894-6198
OpenAccessLink https://www.biorxiv.org/content/10.1101/2023.06.09.544383
PMID 38234740
PQID 2907021701
PQPubID 2050091
PageCount 46
ParticipantIDs biorxiv_primary_2023_06_09_544383
proquest_miscellaneous_2916408044
proquest_journals_2907021701
pubmed_primary_38234740
PublicationCentury 2000
PublicationDate 2023-Dec-28
20231228
PublicationDateYYYYMMDD 2023-12-28
PublicationDate_xml – month: 12
  year: 2023
  text: 2023-Dec-28
  day: 28
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
– name: Cold Spring Harbor
PublicationTitle bioRxiv
PublicationTitleAlternate bioRxiv
PublicationYear 2023
Publisher Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory
Publisher_xml – name: Cold Spring Harbor Laboratory Press
– name: Cold Spring Harbor Laboratory
References 38418819 - Nat Commun. 2024 Feb 28;15(1):1829
Tejavibulya (2023.06.09.544383v2.1) 2022; 27
Chen (2023.06.09.544383v2.44) 2023; 6
Shen (2023.06.09.544383v2.2) 2017; 12
Boyle (2023.06.09.544383v2.6) 2023; 57
Poldrack, Huckins, Varoquaux (2023.06.09.544383v2.18) 2020; 77
Poldrack (2023.06.09.544383v2.19) 2017; 18
Somerville (2023.06.09.544383v2.24) 2018; 183
Rajpurkar (2023.06.09.544383v2.37) 2017
Garrison (2023.06.09.544383v2.13) 2023; 180
Pedregosa (2023.06.09.544383v2.27) 2011; 12
Whelan, Garavan (2023.06.09.544383v2.64) 2014; 75
Spisak (2023.06.09.544383v2.61) 2022; 11
Kaufman, Rosset, Perlich, Stitelman (2023.06.09.544383v2.16) 2012; 6
Yeung, More, Wu, Eickhoff (2023.06.09.544383v2.51) 2022; 256
Tetereva, Pat (2023.06.09.544383v2.7) 2023
Koten (2023.06.09.544383v2.32) 2009; 323
Fortin (2023.06.09.544383v2.31) 2018; 167
Alexander (2023.06.09.544383v2.23) 2017; 4
Li (2023.06.09.544383v2.43) 2022; 8
Dafflon (2023.06.09.544383v2.48) 2022; 13
Hosseini (2023.06.09.544383v2.49) 2020; 119
Carlini, Liu, Erlingsson, Kos, Song (2023.06.09.544383v2.52) 2019; 267
Adhikari (2023.06.09.544383v2.33) 2018; 23
Achenbach, Ruffle (2023.06.09.544383v2.56) 2000; 21
Bzdok, Varoquaux, Steyerberg (2023.06.09.544383v2.15) 2021; 78
Wang, Chaudhari, Davatzikos (2023.06.09.544383v2.45) 2023; 120
Marek (2023.06.09.544383v2.21) 2022; 605
Chyzhyk, Varoquaux, Milham, Thirion (2023.06.09.544383v2.40) 2022; 11
McKeown (2023.06.09.544383v2.62) 1998; 6
Spisak, Bingel, Wager (2023.06.09.544383v2.14) 2023; 615
Papademetris (2023.06.09.544383v2.54) 2006; 209
Sui, Jiang, Bustillo, Calhoun (2023.06.09.544383v2.3) 2020; 88
Zhao (2023.06.09.544383v2.34) 2019; 29
Chen (2023.06.09.544383v2.63) 2008; 1239
More, Eickhoff, Caspers, Patil (2023.06.09.544383v2.39) 2021
Zhai, Li (2023.06.09.544383v2.10) 2019; 13
Gao (2023.06.09.544383v2.11) 2023; 1
Satterthwaite (2023.06.09.544383v2.25) 2014; 86
Wu (2023.06.09.544383v2.12) 2022; 33
Scheinost (2023.06.09.544383v2.28) 2019; 193
Satterthwaite (2023.06.09.544383v2.26) 2016; 124
Bilker (2023.06.09.544383v2.59) 2012; 19
Johnson, Li, Rabinovic (2023.06.09.544383v2.29) 2007; 8
Snoek, Miletić, Scholte (2023.06.09.544383v2.38) 2019; 184
Barron (2023.06.09.544383v2.4) 2020; 31
Wechsler (2023.06.09.544383v2.57) 2014
Horien (2023.06.09.544383v2.42) 2021; 5
Song, Rosenberg (2023.06.09.544383v2.5) 2021; 40
Botvinik-Nezer (2023.06.09.544383v2.47) 2020; 582
Feldman (2023.06.09.544383v2.53) 2020
Miller (2023.06.09.544383v2.58) 2003; 29
Verstynen, Kording (2023.06.09.544383v2.36) 2023
Fortin (2023.06.09.544383v2.30) 2017; 161
Noble, Mejia, Zalesky, Scheinost (2023.06.09.544383v2.35) 2022; 119
Dockès, Varoquaux, Poline (2023.06.09.544383v2.50) 2021; 10
Shen, Tokoglu, Papademetris, Constable (2023.06.09.544383v2.55) 2013; 82
Lund (2023.06.09.544383v2.8) 2022; 33
Casey (2023.06.09.544383v2.22) 2018; 32
Moore, Reise, Gur, Hakonarson, Gur (2023.06.09.544383v2.60) 2015; 29
Kardan (2023.06.09.544383v2.9) 2022; 56
Winkler, Webster, Vidaurre, Nichols, Smith (2023.06.09.544383v2.65) 2015; 123
Kapoor, Narayanan (2023.06.09.544383v2.17) 2023; 4
Hamdan (2023.06.09.544383v2.41) 2022
Varoquaux (2023.06.09.544383v2.46) 2017; 145
Botvinik-Nezer, Wager (2023.06.09.544383v2.20) 2022
References_xml – reference: 38418819 - Nat Commun. 2024 Feb 28;15(1):1829
– volume: 6
  start-page: 1
  year: 2012
  end-page: 21
  ident: 2023.06.09.544383v2.16
  article-title: Leakage in data mining: Formulation, detection, and avoidance
  publication-title: ACM Trans. Knowl. Discov. Data
– volume: 605
  issue: E11
  year: 2022
  ident: 2023.06.09.544383v2.21
  article-title: Reproducible brain-wide association studies require thousands of individuals
  publication-title: Nature
– volume: 180
  start-page: 445
  year: 2023
  end-page: 453
  ident: 2023.06.09.544383v2.13
  article-title: Transdiagnostic Connectome-Based Prediction of Craving
  publication-title: Am. J. Psychiatry
– volume: 33
  start-page: 1412
  year: 2022
  end-page: 1425
  ident: 2023.06.09.544383v2.12
  article-title: Connectome-based predictive modeling of compulsion in obsessive– compulsive disorder
  publication-title: Cereb. Cortex
– volume: 10
  year: 2021
  ident: 2023.06.09.544383v2.50
  article-title: Preventing dataset shift from breaking machine-learning biomarkers
  publication-title: Gigascience
– volume: 267
  year: 2019
  ident: 2023.06.09.544383v2.52
  article-title: The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
  publication-title: USENIX Security Symposium
– volume: 29
  start-page: 2904
  year: 2019
  end-page: 2914
  ident: 2023.06.09.544383v2.34
  article-title: Heritability of Regional Brain Volumes in Large-Scale Neuroimaging and Genetic Studies
  publication-title: Cereb. Cortex
– volume: 6
  start-page: 160
  year: 1998
  end-page: 188
  ident: 2023.06.09.544383v2.62
  article-title: Analysis of fMRI data by blind separation into independent spatial components
  publication-title: Hum. Brain Mapp
– volume: 32
  start-page: 43
  year: 2018
  end-page: 54
  ident: 2023.06.09.544383v2.22
  article-title: The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites
  publication-title: Dev. Cogn. Neurosci
– start-page: 1
  year: 2023
  end-page: 2
  ident: 2023.06.09.544383v2.36
  article-title: Overfitting to ‘predict’suicidal ideation
  publication-title: Nature Human Behaviour
– year: 2014
  ident: 2023.06.09.544383v2.57
  publication-title: WISC-V: Technical and interpretive manual
– volume: 183
  start-page: 456
  year: 2018
  end-page: 468
  ident: 2023.06.09.544383v2.24
  article-title: The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5-21 year olds
  publication-title: Neuroimage
– volume: 88
  start-page: 818
  year: 2020
  end-page: 828
  ident: 2023.06.09.544383v2.3
  article-title: Neuroimaging-based Individualized Prediction of Cognition and Behavior for Mental Disorders and Health: Methods and Promises
  publication-title: Biol. Psychiatry
– volume: 57
  start-page: 490
  year: 2023
  end-page: 510
  ident: 2023.06.09.544383v2.6
  article-title: Connectome-based predictive modelling of cognitive reserve using task-based functional connectivity
  publication-title: Eur. J. Neurosci
– start-page: 3
  year: 2021
  end-page: 18
  ident: 2023.06.09.544383v2.39
  article-title: Confound Removal and Normalization in Practice: A Neuroimaging Based Sex Prediction Case Study. in Machine Learning and Knowledge Discovery in Databases
  publication-title: Applied Data Science and Demo Track
  doi: 10.1007/978-3-030-67670-4_1
– volume: 19
  start-page: 354
  year: 2012
  end-page: 369
  ident: 2023.06.09.544383v2.59
  article-title: Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test
  publication-title: Assessment
– volume: 18
  start-page: 115
  year: 2017
  end-page: 126
  ident: 2023.06.09.544383v2.19
  article-title: Scanning the horizon: towards transparent and reproducible neuroimaging research
  publication-title: Nat. Rev. Neurosci
– volume: 21
  start-page: 265
  year: 2000
  end-page: 271
  ident: 2023.06.09.544383v2.56
  article-title: The Child Behavior Checklist and related forms for assessing behavioral/emotional problems and competencies
  publication-title: Pediatr. Rev
– volume: 5
  start-page: 185
  year: 2021
  end-page: 193
  ident: 2023.06.09.544383v2.42
  article-title: A hitchhiker’s guide to working with large, open-source neuroimaging datasets
  publication-title: Nat Hum Behav
– volume: 13
  start-page: 3758
  year: 2022
  ident: 2023.06.09.544383v2.48
  article-title: A guided multiverse study of neuroimaging analyses
  publication-title: Nat. Commun
– volume: 120
  start-page: e2211613120
  year: 2023
  ident: 2023.06.09.544383v2.45
  article-title: Bias in machine learning models can be significantly mitigated by careful training: Evidence from neuroimaging studies
  publication-title: Proc. Natl. Acad. Sci. U. S. A
– volume: 23
  start-page: 307
  year: 2018
  end-page: 318
  ident: 2023.06.09.544383v2.33
  article-title: Heritability estimates on resting state fMRI data using ENIGMA analysis pipeline
  publication-title: Pac. Symp. Biocomput
– volume: 33
  issue: 102921
  year: 2022
  ident: 2023.06.09.544383v2.8
  article-title: Brain age prediction using fMRI network coupling in youths and associations with psychiatric symptoms
  publication-title: Neuroimage Clin
– volume: 31
  start-page: 2523
  year: 2020
  end-page: 2533
  ident: 2023.06.09.544383v2.4
  article-title: Connectome-Based Prediction of Memory Constructs Across Psychiatric Disorders
  publication-title: Cereb. Cortex
– year: 2023
  ident: 2023.06.09.544383v2.7
  publication-title: The (Limited?) Utility of Brain Age as a Biomarker for Capturing Cognitive Decline
  doi: 10.1101/2022.12.31.522374
– volume: 11
  year: 2022
  ident: 2023.06.09.544383v2.61
  article-title: Statistical quantification of confounding bias in machine learning models
  publication-title: Gigascience
– volume: 323
  start-page: 1737
  year: 2009
  end-page: 1740
  ident: 2023.06.09.544383v2.32
  article-title: Genetic contribution to variation in cognitive function: an FMRI study in twins
  publication-title: Science
– volume: 209
  year: 2006
  ident: 2023.06.09.544383v2.54
  article-title: BioImage Suite: An integrated medical image analysis suite: An update
  publication-title: Insight J. 2006
– volume: 4
  issue: 100804
  year: 2023
  ident: 2023.06.09.544383v2.17
  article-title: Leakage and the reproducibility crisis in machine-learning-based science
  publication-title: Patterns
– volume: 123
  start-page: 253
  year: 2015
  end-page: 268
  ident: 2023.06.09.544383v2.65
  article-title: Multi-level block permutation
  publication-title: Neuroimage
– volume: 8
  start-page: 118
  year: 2007
  end-page: 127
  ident: 2023.06.09.544383v2.29
  article-title: Adjusting batch effects in microarray expression data using empirical Bayes methods
  publication-title: Biostatistics
– volume: 145
  start-page: 166
  year: 2017
  end-page: 179
  ident: 2023.06.09.544383v2.46
  article-title: Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines
  publication-title: Neuroimage
– year: 2022
  ident: 2023.06.09.544383v2.20
  article-title: Reproducibility in Neuroimaging Analysis: Challenges and Solutions
  publication-title: Biol Psychiatry Cogn Neurosci Neuroimaging
  doi: 10.1016/j.bpsc.2022.12.006
– volume: 8
  year: 2022
  ident: 2023.06.09.544383v2.43
  article-title: Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity
  publication-title: Sci Adv
– volume: 256
  issue: 119275
  year: 2022
  ident: 2023.06.09.544383v2.51
  article-title: Reporting details of neuroimaging studies on individual traits prediction: A literature survey
  publication-title: Neuroimage
– volume: 29
  start-page: 235
  year: 2015
  end-page: 246
  ident: 2023.06.09.544383v2.60
  article-title: Psychometric properties of the Penn Computerized Neurocognitive Battery
  publication-title: Neuropsychology
– volume: 77
  start-page: 534
  year: 2020
  end-page: 540
  ident: 2023.06.09.544383v2.18
  article-title: Establishment of Best Practices for Evidence for Prediction: A Review
  publication-title: JAMA Psychiatry
– year: 2017
  ident: 2023.06.09.544383v2.37
  publication-title: CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning
– volume: 119
  start-page: 456
  year: 2020
  end-page: 467
  ident: 2023.06.09.544383v2.49
  article-title: I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data
  publication-title: Neurosci. Biobehav. Rev
– volume: 615
  start-page: E4
  year: 2023
  end-page: E7
  ident: 2023.06.09.544383v2.14
  article-title: Multivariate BWAS can be replicable with moderate sample sizes
  publication-title: Nature
– volume: 119
  start-page: e2203020119
  year: 2022
  ident: 2023.06.09.544383v2.35
  article-title: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference
  publication-title: Proc. Natl. Acad. Sci
– volume: 40
  start-page: 33
  year: 2021
  end-page: 44
  ident: 2023.06.09.544383v2.5
  article-title: Predicting attention across time and contexts with functional brain connectivity
  publication-title: Current Opinion in Behavioral Sciences
– volume: 1
  start-page: 100
  year: 2023
  end-page: 113
  ident: 2023.06.09.544383v2.11
  article-title: Multimodal brain connectome-based prediction of suicide risk in people with late-life depression
  publication-title: Nature Mental Health
– year: 2020
  ident: 2023.06.09.544383v2.53
  publication-title: Does learning require memorization? a short tale about a long tail. in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing 954– 959
  doi: 10.1145/3357713.3384290
– volume: 12
  start-page: 506
  year: 2017
  end-page: 518
  ident: 2023.06.09.544383v2.2
  article-title: Using connectome-based predictive modeling to predict individual behavior from brain connectivity
  publication-title: Nat. Protoc
– year: 2022
  ident: 2023.06.09.544383v2.41
  publication-title: Confound-leakage: Confound Removal in Machine Learning Leads to Leakage
– volume: 56
  issue: 101123
  year: 2022
  ident: 2023.06.09.544383v2.9
  article-title: Resting-state functional connectivity identifies individuals and predicts age in 8-to-26-month-olds
  publication-title: Dev. Cogn. Neurosci
– volume: 11
  year: 2022
  ident: 2023.06.09.544383v2.40
  article-title: How to remove or control confounds in predictive models, with applications to brain biomarkers
  publication-title: Gigascience
– volume: 75
  start-page: 746
  year: 2014
  end-page: 748
  ident: 2023.06.09.544383v2.64
  article-title: When optimism hurts: inflated predictions in psychiatric neuroimaging
  publication-title: Biol. Psychiatry
– volume: 124
  start-page: 1115
  year: 2016
  end-page: 1119
  ident: 2023.06.09.544383v2.26
  article-title: The Philadelphia Neurodevelopmental Cohort: A publicly available resource for the study of normal and abnormal brain development in youth
  publication-title: Neuroimage
– volume: 6
  start-page: e231671
  year: 2023
  ident: 2023.06.09.544383v2.44
  article-title: Evaluation of Risk of Bias in Neuroimaging-Based Artificial Intelligence Models for Psychiatric Diagnosis: A Systematic Review
  publication-title: JAMA Netw Open
– volume: 167
  start-page: 104
  year: 2018
  end-page: 120
  ident: 2023.06.09.544383v2.31
  article-title: Harmonization of cortical thickness measurements across scanners and sites
  publication-title: Neuroimage
– volume: 193
  start-page: 35
  year: 2019
  end-page: 45
  ident: 2023.06.09.544383v2.28
  article-title: Ten simple rules for predictive modeling of individual differences in neuroimaging
  publication-title: Neuroimage
– volume: 1239
  start-page: 141
  year: 2008
  end-page: 151
  ident: 2023.06.09.544383v2.63
  article-title: Group independent component analysis reveals consistent resting-state networks across multiple sessions
  publication-title: Brain Res
– volume: 13
  issue: 62
  year: 2019
  ident: 2023.06.09.544383v2.10
  article-title: Predicting Brain Age Based on Spatial and Temporal Features of Human Brain Functional Networks
  publication-title: Front. Hum. Neurosci
– volume: 29
  start-page: 703
  year: 2003
  end-page: 715
  ident: 2023.06.09.544383v2.58
  article-title: Prodromal assessment with the structured interview for prodromal syndromes and the scale of prodromal symptoms: predictive validity, interrater reliability, and training to reliability
  publication-title: Schizophr. Bull
– volume: 86
  start-page: 544
  year: 2014
  end-page: 553
  ident: 2023.06.09.544383v2.25
  article-title: Neuroimaging of the Philadelphia neurodevelopmental cohort
  publication-title: Neuroimage
– volume: 12
  start-page: 2825
  year: 2011
  end-page: 2830
  ident: 2023.06.09.544383v2.27
  article-title: Scikit-learn: Machine learning in Python
  publication-title: The Journal of machine Learning research
– volume: 161
  start-page: 149
  year: 2017
  end-page: 170
  ident: 2023.06.09.544383v2.30
  article-title: Harmonization of multi-site diffusion tensor imaging data
  publication-title: Neuroimage
– volume: 4
  issue: 170181
  year: 2017
  ident: 2023.06.09.544383v2.23
  article-title: An open resource for transdiagnostic research in pediatric mental health and learning disorders
  publication-title: Sci Data
– volume: 582
  start-page: 84
  year: 2020
  end-page: 88
  ident: 2023.06.09.544383v2.47
  article-title: Variability in the analysis of a single neuroimaging dataset by many teams
  publication-title: Nature
– volume: 78
  start-page: 127
  year: 2021
  end-page: 128
  ident: 2023.06.09.544383v2.15
  article-title: Prediction, Not Association, Paves the Road to Precision Medicine
  publication-title: JAMA Psychiatry
– volume: 184
  start-page: 741
  year: 2019
  end-page: 760
  ident: 2023.06.09.544383v2.38
  article-title: How to control for confounds in decoding analyses of neuroimaging data
  publication-title: Neuroimage
– volume: 82
  start-page: 403
  year: 2013
  end-page: 415
  ident: 2023.06.09.544383v2.55
  article-title: Groupwise whole-brain parcellation from resting-state fMRI data for network node identification
  publication-title: Neuroimage
– volume: 27
  start-page: 3129
  year: 2022
  end-page: 3137
  ident: 2023.06.09.544383v2.1
  article-title: Predicting the future of neuroimaging predictive models in mental health
  publication-title: Mol. Psychiatry
SSID ssj0002961374
Score 1.8570484
SecondaryResourceType preprint
Snippet Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to...
SourceID biorxiv
proquest
pubmed
SourceType Open Access Repository
Aggregation Database
Index Database
SubjectTerms Data integrity
Feature selection
Leakage
Learning algorithms
Machine learning
Medical imaging
Neuroimaging
Neuroscience
Phenotypes
Prediction models
Structure-function relationships
SummonAdditionalLinks – databaseName: ProQuest Central
  dbid: BENPR
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dS8MwEA-6MfDNb6dTIvhabdMsTZ4EZXMIjiEO9lbSJpGha-c-RP9777psPulzS1ruLveR--V-hFwxY0KlYac5jldykswFKhcyiGOtmcgyiEl4G_mpL3pD_jhqj_yB29zDKtc-sXLUpszxjPyGQRWH-XMY3U4_AmSNwu6qp9DYJnVwwRLsvH7X6Q-eN6csTEG4qkYxM6Fg67Ow7VubYIpY-MfV9E51jWPgcHRgIxuXs6_x598JZxV4urukPtBTO9sjW7bYJ40Vc-T3AXkA9VKPxaClowj0pO9Wv4F7oGVBc8Sv5ItyYgOMU4ZOKtCkpZ4l4pVWFDjzQzLsdl7ue4HnRAgyBqlTILgzwrWNTmyslTKR5dIZrCO0kFqHmWCgE5VAmLZaWRfxJDEuzrlymXZax0ekVpSFPSE0lE7mzPHISAPbWIA2I4sEMDbhLmesSS69MNLpavJFigJLEQin0pXAmqS1FlPqjX-e_qoKltg8BrPFXoQubLnEd6BOg2yV8yY5Xol38xVsTfKEh6f_L35GdvB_EFvCZIvUFrOlPYcMYZFdeDP4ARpKtzA
  priority: 102
  providerName: ProQuest
Title The effects of data leakage on connectome-based machine learning models
URI https://www.ncbi.nlm.nih.gov/pubmed/38234740
https://www.proquest.com/docview/2907021701
https://www.proquest.com/docview/2916408044
https://www.biorxiv.org/content/10.1101/2023.06.09.544383
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS8MwEA-6Ifjmt9M5Ivja0aZZk7wqm0PYGOJgbyVpEhlqO_Yh-t9719bhg4IvfehXyt0ld9f75X6E3DBrQ6VhpnmOW3KE8YHKEhnEsdYsMQZ8Eu5GHo2T4ZQ_zHqzH1RfCKs082L5MX8v6_gI2IbVt5rcYYS5elw23FRd7Nwm413SBJNiyNowmHW3v1eYAj8leF3H_PVJiHjrkf6OLksvMzggzYleuOUh2XH5EdmraCI_j8k96JLWwAtaeIqoTvrq9AusBbTIaYZglWxdvLkAnZKlbyVC0tGaEuKZlnw3qxMyHfSf7oZBTYAQGAZxUpBwbxPfs1q4WCtlI8elt5g06ERqHZqEgQKUAJ_stHI-4kJYH2dceaO91vEpaeRF7s4JDaWXGfM8stLCnE1AdZFDthcnuM8Ya5HrWhjpompzkaLAUkS9qbQSWIu0v8WU1pa-Shlk15jXhBG8YnsZbBQLDzp3xQbvgaQMQlPOW-SsEu92FKxDcsHDi398wCXZx3OIJmGyTRrr5cZdQUywNh3SvO2PJ4-d0grgOJ6MvgCCsbKO
linkProvider Cold Spring Harbor Laboratory Press
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT-MwEB6xrdByA3aB8vRKcAwkjuvEB4TEa8sCFVqBxC04sY0QkJS2vP4Uv5GZJC0nuHFO5CSTednzzXwA69wYX2m0NCeoJSdKnacyGXthqDWXaYoxibqRT7uycyH-XbYvJ-Bt1AtDsMqRTywdtSkyOiPf4riLo_zZD3Z6Dx6xRlF1dUShUanFsX19xi3bYPtoH__vBueHB-d7Ha9mFfBSjsmHJ4Uz0rWNjmyolTKBFbEzlIlrGWvtp5LjV6kIA53VyrpARJFxYSaUS7XTOsR1f0BTUEdrA5q7B92z_-NTHa4wPJajn7lU6Gq4365Lqaj6dNAQltNC1SaNnaNRhZPpTdF_uXn6PMEtA93hNDTPdM_2Z2DC5rMwWTFVvv6Cv6hOrMZ-sMIxApayO6tv0R2xImcZ4WWyYXFvPYqLht2XIE3LalaKa1ZS7gx-w8W3SGsOGnmR2wVgfuzijDsRmNig25CoPYElwhkbCZdx3oI_tTCSXjVpIyGBJQS8U0klsBYsj8SU1MY2SD5UA5cYX0YzodqHzm3xSPfgvhCzYyFaMF-Jd_wUKoWKSPiLXy--Bj8756cnyclR93gJpujdCNfC42VoDPuPdgWzk2G6WqsEg6vv1sJ3H672BQ
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDI5gE4gbbwYDgsS1U5tmTXMGynhNOzBptyppEjTB2mkPBP8euy0TB5A4p0oqP2I7_mwTcsmM8aUCTXMcS3KEdp7MotgLQ6VYpDXYJKxGfupHvSG_H3VHP2phEFapx8XsY_xe5vERsA23b6XcfoCxelg23JQd7NwWhx18pu5MjVsnTZCtACU7GXVW7yxMgsESvE5o_roFuL71kX-7maW5SbZJc6CmdrZD1my-SzaqeZGfe-QWmEprBAYtHEV4J32z6hUuBVrkNEPUSrYoJtZD62TopIRKWlrPhnih5eCb-T4ZJjfPVz2vnoTgaQYOkxdxZyLXNUrYUElpAstjZzB6UFGslK8jBpyQAoyzVdK6gAthXJhx6bRySoUHpJEXuT0i1I9dnDHHAxMbUN4IeBhYHPtiBXcZYy1yURMjnVb9LlIkWIrwN5lWBGuR9jeZ0lrk5ymDMBsDHD-ALVbLIKyYgVC5LZb4DURn4KNy3iKHFXlXp2BCkgvuH__jB87J5uA6SR_v-g8nZAuXEWHC4jZpLGZLewp-wkKflYLwBZbztdA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+effects+of+data+leakage+on+connectome-based+machine+learning+models&rft.jtitle=bioRxiv&rft.au=Rosenblatt%2C+Matthew&rft.au=Tejavibulya%2C+Link&rft.au=Jiang%2C+Rongtao&rft.au=Noble%2C+Stephanie&rft.date=2023-12-28&rft.issn=2692-8205&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2023.06.09.544383&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon