Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings
Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced cl...
Saved in:
Published in | Studies in health technology and informatics Vol. 316; p. 929 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Netherlands
22.08.2024
|
Subjects | |
Online Access | Get more information |
Cover
Loading…
Abstract | Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines. |
---|---|
AbstractList | Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines. |
Author | Simon, Christian Raisaro, Jean Louis Kulynych, Bogdan Kaabachi, Bayrem Despraz, Jérémie Wahler, Nina |
Author_xml | – sequence: 1 givenname: Nina surname: Wahler fullname: Wahler, Nina organization: Lausanne University Hospital (CHUV), Switzerland – sequence: 2 givenname: Bayrem surname: Kaabachi fullname: Kaabachi, Bayrem organization: Lausanne University Hospital (CHUV), Switzerland – sequence: 3 givenname: Bogdan surname: Kulynych fullname: Kulynych, Bogdan organization: Lausanne University Hospital (CHUV), Switzerland – sequence: 4 givenname: Jérémie surname: Despraz fullname: Despraz, Jérémie organization: Lausanne University Hospital (CHUV), Switzerland – sequence: 5 givenname: Christian surname: Simon fullname: Simon, Christian organization: Lausanne University Hospital (CHUV), Switzerland – sequence: 6 givenname: Jean Louis surname: Raisaro fullname: Raisaro, Jean Louis organization: Lausanne University Hospital (CHUV), Switzerland |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/39176944$$D View this record in MEDLINE/PubMed |
BookMark | eNqFjksOgkAQRCdG4_8Kpi_gQkGUpUGN7oy4J83Q4CRDjxkGE26v-Fm7qkW9l6qR6LJh6ojhYrMO5xsvWPXFwAsX6yD0_aHg_QN1jU5xAXHD7kZOSdihQ9jWRUnsXp1hcAYiYy1JB7mxH-BUpqiRJYFiuBBqVbVypBUriRrOljIl33pMrp2oJqKXo65o-s2xmB321-g4v9dpSVlyt6pE2yS_g8u_wBNn4UaD |
ContentType | Journal Article |
DBID | CGR CUY CVF ECM EIF NPM |
DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed |
DatabaseTitle | MEDLINE MEDLINE with Full Text Medline Complete PubMed MEDLINE (Ovid) |
DatabaseTitleList | MEDLINE |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | no_fulltext_linktorsrc |
EISSN | 1879-8365 |
ExternalDocumentID | 39176944 |
Genre | Journal Article |
GroupedDBID | CGR CUY CVF ECM EIF NPM |
ID | FETCH-pubmed_primary_391769442 |
IngestDate | Sat Nov 02 12:23:28 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | false |
Keywords | Minority Oversampling Imbalanced Data Synthetic Data |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-pubmed_primary_391769442 |
PMID | 39176944 |
ParticipantIDs | pubmed_primary_39176944 |
PublicationCentury | 2000 |
PublicationDate | 2024-Aug-22 |
PublicationDateYYYYMMDD | 2024-08-22 |
PublicationDate_xml | – month: 08 year: 2024 text: 2024-Aug-22 day: 22 |
PublicationDecade | 2020 |
PublicationPlace | Netherlands |
PublicationPlace_xml | – name: Netherlands |
PublicationTitle | Studies in health technology and informatics |
PublicationTitleAlternate | Stud Health Technol Inform |
PublicationYear | 2024 |
Score | 3.857985 |
Snippet | Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical... |
SourceID | pubmed |
SourceType | Index Database |
StartPage | 929 |
SubjectTerms | Clinical Decision-Making Humans |
Title | Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings |
URI | https://www.ncbi.nlm.nih.gov/pubmed/39176944 |
Volume | 316 |
hasFullText | |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1Ja8JAFIAHbaF4KS3dF5lDbxKxScxytFapQqUUS73JJE5EaEax8WAv_et9b2ayILW0vYQwk4TgZx5vf4TcRAzEXhhGRsR80wANfGKwhh0aDcdkJueo9mOB8-PAeXix-6PmqFT6LGQtrZKgHn58W1fyH6qwBlyxSvYPZLOHwgKcA184AmE4_opxR7fqxqbaawGqHHZfvWcJq7VW01hXFQlUL9s4hCNMZFKhvKAXB5jUGMqeIc8c2yDize20UPJpiREcLU5kbvR7UY_V6Yd4s6qkrCWZj143dNLacJ5M_5pNWh7omd1S0jMWYEKnCn6sl9qVLKNLb2uxVqOq7ubTSf4_BmN5sWTS9d1XkX4d8I9nvOjGMG30y6qK5DpXotdzfcOz1OSIVDZbt0Xp6ivnSIHsIpZoLTA6HV81kvx5d6O5drpVJmXXk9Zzr1she-nyhq0hdY7hAdnXxgJtKfKHpMTFERE5dZpRpwiVFqnTZE41dQoo1AUZdToTNKNOU-o0p05T6sek2u0M2w-GesXxQrUnGacvb56QHTEX_IxQ0GIblu_5nFlNG9sJ2b4bRa7NTHPiMe6ck9MtD7nYunNJKjnHK7IbwUfFr0FDS4Kq_B2_ACDYSco |
link.rule.ids | 783 |
linkProvider | Clarivate |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Evaluating+Synthetic+Data+Augmentation+to+Correct+for+Data+Imbalance+in+Realistic+Clinical+Prediction+Settings&rft.jtitle=Studies+in+health+technology+and+informatics&rft.au=Wahler%2C+Nina&rft.au=Kaabachi%2C+Bayrem&rft.au=Kulynych%2C+Bogdan&rft.au=Despraz%2C+J%C3%A9r%C3%A9mie&rft.date=2024-08-22&rft.eissn=1879-8365&rft.volume=316&rft.spage=929&rft_id=info%3Apmid%2F39176944&rft_id=info%3Apmid%2F39176944&rft.externalDocID=39176944 |