The effects of data leakage on connectome-based machine learning models
Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the v...
Saved in:
Published in | bioRxiv |
---|---|
Main Authors | , , , , |
Format | Journal Article Paper |
Language | English |
Published |
United States
Cold Spring Harbor Laboratory Press
28.12.2023
Cold Spring Harbor Laboratory |
Edition | 1.2 |
Subjects | |
Online Access | Get full text |
ISSN | 2692-8205 2692-8205 |
DOI | 10.1101/2023.06.09.544383 |
Cover
Abstract | Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling. |
---|---|
AbstractList | Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling. Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling. Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.Competing Interest StatementThe authors have declared no competing interest.Footnotes* The discussion section was updated with more potential solutions; the methods section was updated to describe in more detail why we selected the particular types of leakage in this work; a new section analyzing family leakage was added. |
Author | Tejavibulya, Link Jiang, Rongtao Rosenblatt, Matthew Noble, Stephanie Scheinost, Dustin |
Author_xml | – sequence: 1 givenname: Matthew orcidid: 0000-0002-3894-6198 surname: Rosenblatt fullname: Rosenblatt, Matthew organization: Department of Biomedical Engineering, Yale University, New Haven, CT – sequence: 2 givenname: Link surname: Tejavibulya fullname: Tejavibulya, Link organization: Interdepartmental Neuroscience Program, Yale University, New Haven, CT – sequence: 3 givenname: Rongtao surname: Jiang fullname: Jiang, Rongtao organization: Department of Radiology & Biomedical Imaging, Yale School of Medicine, New Haven, CT – sequence: 4 givenname: Stephanie surname: Noble fullname: Noble, Stephanie organization: Department of Psychology, Northeastern University, Boston, MA – sequence: 5 givenname: Dustin surname: Scheinost fullname: Scheinost, Dustin organization: Department of Statistics & Data Science, Yale University, New Haven, CT |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/38234740$$D View this record in MEDLINE/PubMed |
BookMark | eNpd0E1Lw0AQBuBFFK21P8CLBLx4SZz96CZ7lKJVELzUc5hkZ21qsqvZRvTfG_ETTzMwDzPMe8h2ffDE2DGHjHPg5wKEzEBnYLK5UrKQO2witBFpIWC--6c_YLMYNwAgjOYyV_vsQBZCqlzBhC1Xa0rIOaq3MQkusbjFpCV8xAdKgk_q4P04Cx2lFUaySYf1uvH0YXrf-IekC5baeMT2HLaRZl91yu6vLleL6_T2bnmzuLhNK8G1TLVyVru5xZwkGmM5qcJZIUGjLhCh0qKu0eSgOaEhx1WeWydrZVyFDlFO2dnn3qoJ_WvzUj71TYf9W_kRRwm6BFN-xvFLn_rwPFDcll0Ta2pb9BSGWArDtYICRj1lp__oJgy9Hx8ZFeQgeA58VCdfaqg6sj-nv9OU78ESdqI |
Cites_doi | 10.1007/978-3-030-67670-4_1 10.1101/2022.12.31.522374 10.1016/j.bpsc.2022.12.006 10.1145/3357713.3384290 |
ContentType | Journal Article Paper |
Copyright | 2023. This article is published under http://creativecommons.org/licenses/by-nd/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. 2023, Posted by Cold Spring Harbor Laboratory |
Copyright_xml | – notice: 2023. This article is published under http://creativecommons.org/licenses/by-nd/4.0/ (“the License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: 2023, Posted by Cold Spring Harbor Laboratory |
DBID | NPM 8FE 8FH ABUWG AFKRA AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS 7X8 FX. |
DOI | 10.1101/2023.06.09.544383 |
DatabaseName | PubMed ProQuest SciTech Collection ProQuest Natural Science Journals ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials Biological Science Database ProQuest Central Natural Science Collection ProQuest One Community College ProQuest Central ProQuest Central Student SciTech Premium Collection ProQuest Biological Science Collection Biological Science Database ProQuest Central Premium ProQuest One Academic Publicly Available Content Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China MEDLINE - Academic bioRxiv |
DatabaseTitle | PubMed Publicly Available Content Database ProQuest Central Student ProQuest One Academic Middle East (New) ProQuest Biological Science Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Natural Science Collection Biological Science Database ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition Natural Science Collection ProQuest Central Korea Biological Science Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) MEDLINE - Academic |
DatabaseTitleList | PubMed MEDLINE - Academic Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: FX. name: bioRxiv url: https://www.biorxiv.org/ sourceTypes: Open Access Repository – sequence: 2 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 3 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Biology |
EISSN | 2692-8205 |
Edition | 1.2 |
ExternalDocumentID | 2023.06.09.544383v2 38234740 |
Genre | Preprint Working Paper/Pre-Print |
GrantInformation_xml | – fundername: NIDA NIH HHS grantid: U01 DA041134 – fundername: NIMH NIH HHS grantid: RC2 MH089924 – fundername: NIDA NIH HHS grantid: U01 DA041028 – fundername: NIDA NIH HHS grantid: U01 DA051016 – fundername: NIDA NIH HHS grantid: U24 DA041123 – fundername: NIDA NIH HHS grantid: U01 DA051039 – fundername: NIDA NIH HHS grantid: U01 DA041048 – fundername: NIDA NIH HHS grantid: U24 DA041147 – fundername: NIDA NIH HHS grantid: U01 DA050989 – fundername: NIMH NIH HHS grantid: R01 MH121095 |
GroupedDBID | NPM 8FE 8FH ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BBNVY BENPR BHPHI CCPQU DWQXO GNUQQ HCIFZ LK8 M7P NQS PHGZM PHGZT PIMPY PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PROAC RHI 7X8 PUEGO FX. |
ID | FETCH-LOGICAL-b2163-64fd6f5da7e3a99d1e48fd2306a68aa0b62cca97061ea9ef1477df3c49fbafaa3 |
IEDL.DBID | FX. |
ISSN | 2692-8205 |
IngestDate | Tue Jan 07 18:59:59 EST 2025 Thu Sep 04 20:45:46 EDT 2025 Fri Jul 25 09:18:55 EDT 2025 Thu Apr 03 07:10:16 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
License | This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-b2163-64fd6f5da7e3a99d1e48fd2306a68aa0b62cca97061ea9ef1477df3c49fbafaa3 |
Notes | SourceType-Working Papers-1 ObjectType-Working Paper/Pre-Print-1 content type line 50 ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-3 content type line 23 Competing Interest Statement: The authors have declared no competing interest. |
ORCID | 0000-0002-3894-6198 |
OpenAccessLink | https://www.biorxiv.org/content/10.1101/2023.06.09.544383 |
PMID | 38234740 |
PQID | 2907021701 |
PQPubID | 2050091 |
PageCount | 46 |
ParticipantIDs | biorxiv_primary_2023_06_09_544383 proquest_miscellaneous_2916408044 proquest_journals_2907021701 pubmed_primary_38234740 |
PublicationCentury | 2000 |
PublicationDate | 2023-Dec-28 20231228 |
PublicationDateYYYYMMDD | 2023-12-28 |
PublicationDate_xml | – month: 12 year: 2023 text: 2023-Dec-28 day: 28 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States – name: Cold Spring Harbor |
PublicationTitle | bioRxiv |
PublicationTitleAlternate | bioRxiv |
PublicationYear | 2023 |
Publisher | Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory |
Publisher_xml | – name: Cold Spring Harbor Laboratory Press – name: Cold Spring Harbor Laboratory |
References | 38418819 - Nat Commun. 2024 Feb 28;15(1):1829 Tejavibulya (2023.06.09.544383v2.1) 2022; 27 Chen (2023.06.09.544383v2.44) 2023; 6 Shen (2023.06.09.544383v2.2) 2017; 12 Boyle (2023.06.09.544383v2.6) 2023; 57 Poldrack, Huckins, Varoquaux (2023.06.09.544383v2.18) 2020; 77 Poldrack (2023.06.09.544383v2.19) 2017; 18 Somerville (2023.06.09.544383v2.24) 2018; 183 Rajpurkar (2023.06.09.544383v2.37) 2017 Garrison (2023.06.09.544383v2.13) 2023; 180 Pedregosa (2023.06.09.544383v2.27) 2011; 12 Whelan, Garavan (2023.06.09.544383v2.64) 2014; 75 Spisak (2023.06.09.544383v2.61) 2022; 11 Kaufman, Rosset, Perlich, Stitelman (2023.06.09.544383v2.16) 2012; 6 Yeung, More, Wu, Eickhoff (2023.06.09.544383v2.51) 2022; 256 Tetereva, Pat (2023.06.09.544383v2.7) 2023 Koten (2023.06.09.544383v2.32) 2009; 323 Fortin (2023.06.09.544383v2.31) 2018; 167 Alexander (2023.06.09.544383v2.23) 2017; 4 Li (2023.06.09.544383v2.43) 2022; 8 Dafflon (2023.06.09.544383v2.48) 2022; 13 Hosseini (2023.06.09.544383v2.49) 2020; 119 Carlini, Liu, Erlingsson, Kos, Song (2023.06.09.544383v2.52) 2019; 267 Adhikari (2023.06.09.544383v2.33) 2018; 23 Achenbach, Ruffle (2023.06.09.544383v2.56) 2000; 21 Bzdok, Varoquaux, Steyerberg (2023.06.09.544383v2.15) 2021; 78 Wang, Chaudhari, Davatzikos (2023.06.09.544383v2.45) 2023; 120 Marek (2023.06.09.544383v2.21) 2022; 605 Chyzhyk, Varoquaux, Milham, Thirion (2023.06.09.544383v2.40) 2022; 11 McKeown (2023.06.09.544383v2.62) 1998; 6 Spisak, Bingel, Wager (2023.06.09.544383v2.14) 2023; 615 Papademetris (2023.06.09.544383v2.54) 2006; 209 Sui, Jiang, Bustillo, Calhoun (2023.06.09.544383v2.3) 2020; 88 Zhao (2023.06.09.544383v2.34) 2019; 29 Chen (2023.06.09.544383v2.63) 2008; 1239 More, Eickhoff, Caspers, Patil (2023.06.09.544383v2.39) 2021 Zhai, Li (2023.06.09.544383v2.10) 2019; 13 Gao (2023.06.09.544383v2.11) 2023; 1 Satterthwaite (2023.06.09.544383v2.25) 2014; 86 Wu (2023.06.09.544383v2.12) 2022; 33 Scheinost (2023.06.09.544383v2.28) 2019; 193 Satterthwaite (2023.06.09.544383v2.26) 2016; 124 Bilker (2023.06.09.544383v2.59) 2012; 19 Johnson, Li, Rabinovic (2023.06.09.544383v2.29) 2007; 8 Snoek, Miletić, Scholte (2023.06.09.544383v2.38) 2019; 184 Barron (2023.06.09.544383v2.4) 2020; 31 Wechsler (2023.06.09.544383v2.57) 2014 Horien (2023.06.09.544383v2.42) 2021; 5 Song, Rosenberg (2023.06.09.544383v2.5) 2021; 40 Botvinik-Nezer (2023.06.09.544383v2.47) 2020; 582 Feldman (2023.06.09.544383v2.53) 2020 Miller (2023.06.09.544383v2.58) 2003; 29 Verstynen, Kording (2023.06.09.544383v2.36) 2023 Fortin (2023.06.09.544383v2.30) 2017; 161 Noble, Mejia, Zalesky, Scheinost (2023.06.09.544383v2.35) 2022; 119 Dockès, Varoquaux, Poline (2023.06.09.544383v2.50) 2021; 10 Shen, Tokoglu, Papademetris, Constable (2023.06.09.544383v2.55) 2013; 82 Lund (2023.06.09.544383v2.8) 2022; 33 Casey (2023.06.09.544383v2.22) 2018; 32 Moore, Reise, Gur, Hakonarson, Gur (2023.06.09.544383v2.60) 2015; 29 Kardan (2023.06.09.544383v2.9) 2022; 56 Winkler, Webster, Vidaurre, Nichols, Smith (2023.06.09.544383v2.65) 2015; 123 Kapoor, Narayanan (2023.06.09.544383v2.17) 2023; 4 Hamdan (2023.06.09.544383v2.41) 2022 Varoquaux (2023.06.09.544383v2.46) 2017; 145 Botvinik-Nezer, Wager (2023.06.09.544383v2.20) 2022 |
References_xml | – reference: 38418819 - Nat Commun. 2024 Feb 28;15(1):1829 – volume: 6 start-page: 1 year: 2012 end-page: 21 ident: 2023.06.09.544383v2.16 article-title: Leakage in data mining: Formulation, detection, and avoidance publication-title: ACM Trans. Knowl. Discov. Data – volume: 605 issue: E11 year: 2022 ident: 2023.06.09.544383v2.21 article-title: Reproducible brain-wide association studies require thousands of individuals publication-title: Nature – volume: 180 start-page: 445 year: 2023 end-page: 453 ident: 2023.06.09.544383v2.13 article-title: Transdiagnostic Connectome-Based Prediction of Craving publication-title: Am. J. Psychiatry – volume: 33 start-page: 1412 year: 2022 end-page: 1425 ident: 2023.06.09.544383v2.12 article-title: Connectome-based predictive modeling of compulsion in obsessive– compulsive disorder publication-title: Cereb. Cortex – volume: 10 year: 2021 ident: 2023.06.09.544383v2.50 article-title: Preventing dataset shift from breaking machine-learning biomarkers publication-title: Gigascience – volume: 267 year: 2019 ident: 2023.06.09.544383v2.52 article-title: The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks publication-title: USENIX Security Symposium – volume: 29 start-page: 2904 year: 2019 end-page: 2914 ident: 2023.06.09.544383v2.34 article-title: Heritability of Regional Brain Volumes in Large-Scale Neuroimaging and Genetic Studies publication-title: Cereb. Cortex – volume: 6 start-page: 160 year: 1998 end-page: 188 ident: 2023.06.09.544383v2.62 article-title: Analysis of fMRI data by blind separation into independent spatial components publication-title: Hum. Brain Mapp – volume: 32 start-page: 43 year: 2018 end-page: 54 ident: 2023.06.09.544383v2.22 article-title: The Adolescent Brain Cognitive Development (ABCD) study: Imaging acquisition across 21 sites publication-title: Dev. Cogn. Neurosci – start-page: 1 year: 2023 end-page: 2 ident: 2023.06.09.544383v2.36 article-title: Overfitting to ‘predict’suicidal ideation publication-title: Nature Human Behaviour – year: 2014 ident: 2023.06.09.544383v2.57 publication-title: WISC-V: Technical and interpretive manual – volume: 183 start-page: 456 year: 2018 end-page: 468 ident: 2023.06.09.544383v2.24 article-title: The Lifespan Human Connectome Project in Development: A large-scale study of brain connectivity development in 5-21 year olds publication-title: Neuroimage – volume: 88 start-page: 818 year: 2020 end-page: 828 ident: 2023.06.09.544383v2.3 article-title: Neuroimaging-based Individualized Prediction of Cognition and Behavior for Mental Disorders and Health: Methods and Promises publication-title: Biol. Psychiatry – volume: 57 start-page: 490 year: 2023 end-page: 510 ident: 2023.06.09.544383v2.6 article-title: Connectome-based predictive modelling of cognitive reserve using task-based functional connectivity publication-title: Eur. J. Neurosci – start-page: 3 year: 2021 end-page: 18 ident: 2023.06.09.544383v2.39 article-title: Confound Removal and Normalization in Practice: A Neuroimaging Based Sex Prediction Case Study. in Machine Learning and Knowledge Discovery in Databases publication-title: Applied Data Science and Demo Track doi: 10.1007/978-3-030-67670-4_1 – volume: 19 start-page: 354 year: 2012 end-page: 369 ident: 2023.06.09.544383v2.59 article-title: Development of abbreviated nine-item forms of the Raven’s standard progressive matrices test publication-title: Assessment – volume: 18 start-page: 115 year: 2017 end-page: 126 ident: 2023.06.09.544383v2.19 article-title: Scanning the horizon: towards transparent and reproducible neuroimaging research publication-title: Nat. Rev. Neurosci – volume: 21 start-page: 265 year: 2000 end-page: 271 ident: 2023.06.09.544383v2.56 article-title: The Child Behavior Checklist and related forms for assessing behavioral/emotional problems and competencies publication-title: Pediatr. Rev – volume: 5 start-page: 185 year: 2021 end-page: 193 ident: 2023.06.09.544383v2.42 article-title: A hitchhiker’s guide to working with large, open-source neuroimaging datasets publication-title: Nat Hum Behav – volume: 13 start-page: 3758 year: 2022 ident: 2023.06.09.544383v2.48 article-title: A guided multiverse study of neuroimaging analyses publication-title: Nat. Commun – volume: 120 start-page: e2211613120 year: 2023 ident: 2023.06.09.544383v2.45 article-title: Bias in machine learning models can be significantly mitigated by careful training: Evidence from neuroimaging studies publication-title: Proc. Natl. Acad. Sci. U. S. A – volume: 23 start-page: 307 year: 2018 end-page: 318 ident: 2023.06.09.544383v2.33 article-title: Heritability estimates on resting state fMRI data using ENIGMA analysis pipeline publication-title: Pac. Symp. Biocomput – volume: 33 issue: 102921 year: 2022 ident: 2023.06.09.544383v2.8 article-title: Brain age prediction using fMRI network coupling in youths and associations with psychiatric symptoms publication-title: Neuroimage Clin – volume: 31 start-page: 2523 year: 2020 end-page: 2533 ident: 2023.06.09.544383v2.4 article-title: Connectome-Based Prediction of Memory Constructs Across Psychiatric Disorders publication-title: Cereb. Cortex – year: 2023 ident: 2023.06.09.544383v2.7 publication-title: The (Limited?) Utility of Brain Age as a Biomarker for Capturing Cognitive Decline doi: 10.1101/2022.12.31.522374 – volume: 11 year: 2022 ident: 2023.06.09.544383v2.61 article-title: Statistical quantification of confounding bias in machine learning models publication-title: Gigascience – volume: 323 start-page: 1737 year: 2009 end-page: 1740 ident: 2023.06.09.544383v2.32 article-title: Genetic contribution to variation in cognitive function: an FMRI study in twins publication-title: Science – volume: 209 year: 2006 ident: 2023.06.09.544383v2.54 article-title: BioImage Suite: An integrated medical image analysis suite: An update publication-title: Insight J. 2006 – volume: 4 issue: 100804 year: 2023 ident: 2023.06.09.544383v2.17 article-title: Leakage and the reproducibility crisis in machine-learning-based science publication-title: Patterns – volume: 123 start-page: 253 year: 2015 end-page: 268 ident: 2023.06.09.544383v2.65 article-title: Multi-level block permutation publication-title: Neuroimage – volume: 8 start-page: 118 year: 2007 end-page: 127 ident: 2023.06.09.544383v2.29 article-title: Adjusting batch effects in microarray expression data using empirical Bayes methods publication-title: Biostatistics – volume: 145 start-page: 166 year: 2017 end-page: 179 ident: 2023.06.09.544383v2.46 article-title: Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines publication-title: Neuroimage – year: 2022 ident: 2023.06.09.544383v2.20 article-title: Reproducibility in Neuroimaging Analysis: Challenges and Solutions publication-title: Biol Psychiatry Cogn Neurosci Neuroimaging doi: 10.1016/j.bpsc.2022.12.006 – volume: 8 year: 2022 ident: 2023.06.09.544383v2.43 article-title: Cross-ethnicity/race generalization failure of behavioral prediction from resting-state functional connectivity publication-title: Sci Adv – volume: 256 issue: 119275 year: 2022 ident: 2023.06.09.544383v2.51 article-title: Reporting details of neuroimaging studies on individual traits prediction: A literature survey publication-title: Neuroimage – volume: 29 start-page: 235 year: 2015 end-page: 246 ident: 2023.06.09.544383v2.60 article-title: Psychometric properties of the Penn Computerized Neurocognitive Battery publication-title: Neuropsychology – volume: 77 start-page: 534 year: 2020 end-page: 540 ident: 2023.06.09.544383v2.18 article-title: Establishment of Best Practices for Evidence for Prediction: A Review publication-title: JAMA Psychiatry – year: 2017 ident: 2023.06.09.544383v2.37 publication-title: CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning – volume: 119 start-page: 456 year: 2020 end-page: 467 ident: 2023.06.09.544383v2.49 article-title: I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data publication-title: Neurosci. Biobehav. Rev – volume: 615 start-page: E4 year: 2023 end-page: E7 ident: 2023.06.09.544383v2.14 article-title: Multivariate BWAS can be replicable with moderate sample sizes publication-title: Nature – volume: 119 start-page: e2203020119 year: 2022 ident: 2023.06.09.544383v2.35 article-title: Improving power in functional magnetic resonance imaging by moving beyond cluster-level inference publication-title: Proc. Natl. Acad. Sci – volume: 40 start-page: 33 year: 2021 end-page: 44 ident: 2023.06.09.544383v2.5 article-title: Predicting attention across time and contexts with functional brain connectivity publication-title: Current Opinion in Behavioral Sciences – volume: 1 start-page: 100 year: 2023 end-page: 113 ident: 2023.06.09.544383v2.11 article-title: Multimodal brain connectome-based prediction of suicide risk in people with late-life depression publication-title: Nature Mental Health – year: 2020 ident: 2023.06.09.544383v2.53 publication-title: Does learning require memorization? a short tale about a long tail. in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing 954– 959 doi: 10.1145/3357713.3384290 – volume: 12 start-page: 506 year: 2017 end-page: 518 ident: 2023.06.09.544383v2.2 article-title: Using connectome-based predictive modeling to predict individual behavior from brain connectivity publication-title: Nat. Protoc – year: 2022 ident: 2023.06.09.544383v2.41 publication-title: Confound-leakage: Confound Removal in Machine Learning Leads to Leakage – volume: 56 issue: 101123 year: 2022 ident: 2023.06.09.544383v2.9 article-title: Resting-state functional connectivity identifies individuals and predicts age in 8-to-26-month-olds publication-title: Dev. Cogn. Neurosci – volume: 11 year: 2022 ident: 2023.06.09.544383v2.40 article-title: How to remove or control confounds in predictive models, with applications to brain biomarkers publication-title: Gigascience – volume: 75 start-page: 746 year: 2014 end-page: 748 ident: 2023.06.09.544383v2.64 article-title: When optimism hurts: inflated predictions in psychiatric neuroimaging publication-title: Biol. Psychiatry – volume: 124 start-page: 1115 year: 2016 end-page: 1119 ident: 2023.06.09.544383v2.26 article-title: The Philadelphia Neurodevelopmental Cohort: A publicly available resource for the study of normal and abnormal brain development in youth publication-title: Neuroimage – volume: 6 start-page: e231671 year: 2023 ident: 2023.06.09.544383v2.44 article-title: Evaluation of Risk of Bias in Neuroimaging-Based Artificial Intelligence Models for Psychiatric Diagnosis: A Systematic Review publication-title: JAMA Netw Open – volume: 167 start-page: 104 year: 2018 end-page: 120 ident: 2023.06.09.544383v2.31 article-title: Harmonization of cortical thickness measurements across scanners and sites publication-title: Neuroimage – volume: 193 start-page: 35 year: 2019 end-page: 45 ident: 2023.06.09.544383v2.28 article-title: Ten simple rules for predictive modeling of individual differences in neuroimaging publication-title: Neuroimage – volume: 1239 start-page: 141 year: 2008 end-page: 151 ident: 2023.06.09.544383v2.63 article-title: Group independent component analysis reveals consistent resting-state networks across multiple sessions publication-title: Brain Res – volume: 13 issue: 62 year: 2019 ident: 2023.06.09.544383v2.10 article-title: Predicting Brain Age Based on Spatial and Temporal Features of Human Brain Functional Networks publication-title: Front. Hum. Neurosci – volume: 29 start-page: 703 year: 2003 end-page: 715 ident: 2023.06.09.544383v2.58 article-title: Prodromal assessment with the structured interview for prodromal syndromes and the scale of prodromal symptoms: predictive validity, interrater reliability, and training to reliability publication-title: Schizophr. Bull – volume: 86 start-page: 544 year: 2014 end-page: 553 ident: 2023.06.09.544383v2.25 article-title: Neuroimaging of the Philadelphia neurodevelopmental cohort publication-title: Neuroimage – volume: 12 start-page: 2825 year: 2011 end-page: 2830 ident: 2023.06.09.544383v2.27 article-title: Scikit-learn: Machine learning in Python publication-title: The Journal of machine Learning research – volume: 161 start-page: 149 year: 2017 end-page: 170 ident: 2023.06.09.544383v2.30 article-title: Harmonization of multi-site diffusion tensor imaging data publication-title: Neuroimage – volume: 4 issue: 170181 year: 2017 ident: 2023.06.09.544383v2.23 article-title: An open resource for transdiagnostic research in pediatric mental health and learning disorders publication-title: Sci Data – volume: 582 start-page: 84 year: 2020 end-page: 88 ident: 2023.06.09.544383v2.47 article-title: Variability in the analysis of a single neuroimaging dataset by many teams publication-title: Nature – volume: 78 start-page: 127 year: 2021 end-page: 128 ident: 2023.06.09.544383v2.15 article-title: Prediction, Not Association, Paves the Road to Precision Medicine publication-title: JAMA Psychiatry – volume: 184 start-page: 741 year: 2019 end-page: 760 ident: 2023.06.09.544383v2.38 article-title: How to control for confounds in decoding analyses of neuroimaging data publication-title: Neuroimage – volume: 82 start-page: 403 year: 2013 end-page: 415 ident: 2023.06.09.544383v2.55 article-title: Groupwise whole-brain parcellation from resting-state fMRI data for network node identification publication-title: Neuroimage – volume: 27 start-page: 3129 year: 2022 end-page: 3137 ident: 2023.06.09.544383v2.1 article-title: Predicting the future of neuroimaging predictive models in mental health publication-title: Mol. Psychiatry |
SSID | ssj0002961374 |
Score | 1.8570484 |
SecondaryResourceType | preprint |
Snippet | Predictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to... |
SourceID | biorxiv proquest pubmed |
SourceType | Open Access Repository Aggregation Database Index Database |
SubjectTerms | Data integrity Feature selection Leakage Learning algorithms Machine learning Medical imaging Neuroimaging Neuroscience Phenotypes Prediction models Structure-function relationships |
SummonAdditionalLinks | – databaseName: ProQuest Central dbid: BENPR link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dS8MwEA-6MfDNb6dTIvhabdMsTZ4EZXMIjiEO9lbSJpGha-c-RP9777psPulzS1ruLveR--V-hFwxY0KlYac5jldykswFKhcyiGOtmcgyiEl4G_mpL3pD_jhqj_yB29zDKtc-sXLUpszxjPyGQRWH-XMY3U4_AmSNwu6qp9DYJnVwwRLsvH7X6Q-eN6csTEG4qkYxM6Fg67Ow7VubYIpY-MfV9E51jWPgcHRgIxuXs6_x598JZxV4urukPtBTO9sjW7bYJ40Vc-T3AXkA9VKPxaClowj0pO9Wv4F7oGVBc8Sv5ItyYgOMU4ZOKtCkpZ4l4pVWFDjzQzLsdl7ue4HnRAgyBqlTILgzwrWNTmyslTKR5dIZrCO0kFqHmWCgE5VAmLZaWRfxJDEuzrlymXZax0ekVpSFPSE0lE7mzPHISAPbWIA2I4sEMDbhLmesSS69MNLpavJFigJLEQin0pXAmqS1FlPqjX-e_qoKltg8BrPFXoQubLnEd6BOg2yV8yY5Xol38xVsTfKEh6f_L35GdvB_EFvCZIvUFrOlPYcMYZFdeDP4ARpKtzA priority: 102 providerName: ProQuest |
Title | The effects of data leakage on connectome-based machine learning models |
URI | https://www.ncbi.nlm.nih.gov/pubmed/38234740 https://www.proquest.com/docview/2907021701 https://www.proquest.com/docview/2916408044 https://www.biorxiv.org/content/10.1101/2023.06.09.544383 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS8MwEA-6Ifjmt9M5Ivja0aZZk7wqm0PYGOJgbyVpEhlqO_Yh-t9719bhg4IvfehXyt0ld9f75X6E3DBrQ6VhpnmOW3KE8YHKEhnEsdYsMQZ8Eu5GHo2T4ZQ_zHqzH1RfCKs082L5MX8v6_gI2IbVt5rcYYS5elw23FRd7Nwm413SBJNiyNowmHW3v1eYAj8leF3H_PVJiHjrkf6OLksvMzggzYleuOUh2XH5EdmraCI_j8k96JLWwAtaeIqoTvrq9AusBbTIaYZglWxdvLkAnZKlbyVC0tGaEuKZlnw3qxMyHfSf7oZBTYAQGAZxUpBwbxPfs1q4WCtlI8elt5g06ERqHZqEgQKUAJ_stHI-4kJYH2dceaO91vEpaeRF7s4JDaWXGfM8stLCnE1AdZFDthcnuM8Ya5HrWhjpompzkaLAUkS9qbQSWIu0v8WU1pa-Shlk15jXhBG8YnsZbBQLDzp3xQbvgaQMQlPOW-SsEu92FKxDcsHDi398wCXZx3OIJmGyTRrr5cZdQUywNh3SvO2PJ4-d0grgOJ6MvgCCsbKO |
linkProvider | Cold Spring Harbor Laboratory Press |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT-MwEB6xrdByA3aB8vRKcAwkjuvEB4TEa8sCFVqBxC04sY0QkJS2vP4Uv5GZJC0nuHFO5CSTednzzXwA69wYX2m0NCeoJSdKnacyGXthqDWXaYoxibqRT7uycyH-XbYvJ-Bt1AtDsMqRTywdtSkyOiPf4riLo_zZD3Z6Dx6xRlF1dUShUanFsX19xi3bYPtoH__vBueHB-d7Ha9mFfBSjsmHJ4Uz0rWNjmyolTKBFbEzlIlrGWvtp5LjV6kIA53VyrpARJFxYSaUS7XTOsR1f0BTUEdrA5q7B92z_-NTHa4wPJajn7lU6Gq4365Lqaj6dNAQltNC1SaNnaNRhZPpTdF_uXn6PMEtA93hNDTPdM_2Z2DC5rMwWTFVvv6Cv6hOrMZ-sMIxApayO6tv0R2xImcZ4WWyYXFvPYqLht2XIE3LalaKa1ZS7gx-w8W3SGsOGnmR2wVgfuzijDsRmNig25CoPYElwhkbCZdx3oI_tTCSXjVpIyGBJQS8U0klsBYsj8SU1MY2SD5UA5cYX0YzodqHzm3xSPfgvhCzYyFaMF-Jd_wUKoWKSPiLXy--Bj8756cnyclR93gJpujdCNfC42VoDPuPdgWzk2G6WqsEg6vv1sJ3H672BQ |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDI5gE4gbbwYDgsS1U5tmTXMGynhNOzBptyppEjTB2mkPBP8euy0TB5A4p0oqP2I7_mwTcsmM8aUCTXMcS3KEdp7MotgLQ6VYpDXYJKxGfupHvSG_H3VHP2phEFapx8XsY_xe5vERsA23b6XcfoCxelg23JQd7NwWhx18pu5MjVsnTZCtACU7GXVW7yxMgsESvE5o_roFuL71kX-7maW5SbZJc6CmdrZD1my-SzaqeZGfe-QWmEprBAYtHEV4J32z6hUuBVrkNEPUSrYoJtZD62TopIRKWlrPhnih5eCb-T4ZJjfPVz2vnoTgaQYOkxdxZyLXNUrYUElpAstjZzB6UFGslK8jBpyQAoyzVdK6gAthXJhx6bRySoUHpJEXuT0i1I9dnDHHAxMbUN4IeBhYHPtiBXcZYy1yURMjnVb9LlIkWIrwN5lWBGuR9jeZ0lrk5ymDMBsDHD-ALVbLIKyYgVC5LZb4DURn4KNy3iKHFXlXp2BCkgvuH__jB87J5uA6SR_v-g8nZAuXEWHC4jZpLGZLewp-wkKflYLwBZbztdA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+effects+of+data+leakage+on+connectome-based+machine+learning+models&rft.jtitle=bioRxiv&rft.au=Rosenblatt%2C+Matthew&rft.au=Tejavibulya%2C+Link&rft.au=Jiang%2C+Rongtao&rft.au=Noble%2C+Stephanie&rft.date=2023-12-28&rft.issn=2692-8205&rft.eissn=2692-8205&rft_id=info:doi/10.1101%2F2023.06.09.544383&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2692-8205&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2692-8205&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2692-8205&client=summon |