Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings

Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced cl...

Full description

Saved in:
Bibliographic Details
Published inStudies in health technology and informatics Vol. 316; p. 929
Main Authors Wahler, Nina, Kaabachi, Bayrem, Kulynych, Bogdan, Despraz, Jérémie, Simon, Christian, Raisaro, Jean Louis
Format Journal Article
LanguageEnglish
Published Netherlands 22.08.2024
Subjects
Online AccessGet more information

Cover

Loading…
Abstract Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines.
AbstractList Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines.
Author Simon, Christian
Raisaro, Jean Louis
Kulynych, Bogdan
Kaabachi, Bayrem
Despraz, Jérémie
Wahler, Nina
Author_xml – sequence: 1
  givenname: Nina
  surname: Wahler
  fullname: Wahler, Nina
  organization: Lausanne University Hospital (CHUV), Switzerland
– sequence: 2
  givenname: Bayrem
  surname: Kaabachi
  fullname: Kaabachi, Bayrem
  organization: Lausanne University Hospital (CHUV), Switzerland
– sequence: 3
  givenname: Bogdan
  surname: Kulynych
  fullname: Kulynych, Bogdan
  organization: Lausanne University Hospital (CHUV), Switzerland
– sequence: 4
  givenname: Jérémie
  surname: Despraz
  fullname: Despraz, Jérémie
  organization: Lausanne University Hospital (CHUV), Switzerland
– sequence: 5
  givenname: Christian
  surname: Simon
  fullname: Simon, Christian
  organization: Lausanne University Hospital (CHUV), Switzerland
– sequence: 6
  givenname: Jean Louis
  surname: Raisaro
  fullname: Raisaro, Jean Louis
  organization: Lausanne University Hospital (CHUV), Switzerland
BackLink https://www.ncbi.nlm.nih.gov/pubmed/39176944$$D View this record in MEDLINE/PubMed
BookMark eNqFjksOgkAQRCdG4_8Kpi_gQkGUpUGN7oy4J83Q4CRDjxkGE26v-Fm7qkW9l6qR6LJh6ojhYrMO5xsvWPXFwAsX6yD0_aHg_QN1jU5xAXHD7kZOSdihQ9jWRUnsXp1hcAYiYy1JB7mxH-BUpqiRJYFiuBBqVbVypBUriRrOljIl33pMrp2oJqKXo65o-s2xmB321-g4v9dpSVlyt6pE2yS_g8u_wBNn4UaD
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
MEDLINE with Full Text
Medline Complete
PubMed
MEDLINE (Ovid)
DatabaseTitleList MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
EISSN 1879-8365
ExternalDocumentID 39176944
Genre Journal Article
GroupedDBID CGR
CUY
CVF
ECM
EIF
NPM
ID FETCH-pubmed_primary_391769442
IngestDate Sat Nov 02 12:23:28 EDT 2024
IsPeerReviewed false
IsScholarly false
Keywords Minority Oversampling
Imbalanced Data
Synthetic Data
Language English
LinkModel OpenURL
MergedId FETCHMERGED-pubmed_primary_391769442
PMID 39176944
ParticipantIDs pubmed_primary_39176944
PublicationCentury 2000
PublicationDate 2024-Aug-22
PublicationDateYYYYMMDD 2024-08-22
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-Aug-22
  day: 22
PublicationDecade 2020
PublicationPlace Netherlands
PublicationPlace_xml – name: Netherlands
PublicationTitle Studies in health technology and informatics
PublicationTitleAlternate Stud Health Technol Inform
PublicationYear 2024
Score 3.857985
Snippet Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical...
SourceID pubmed
SourceType Index Database
StartPage 929
SubjectTerms Clinical Decision-Making
Humans
Title Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings
URI https://www.ncbi.nlm.nih.gov/pubmed/39176944
Volume 316
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1Ja8JAFIAHbaF4KS3dF5lDbxKxScxytFapQqUUS73JJE5EaEax8WAv_et9b2ayILW0vYQwk4TgZx5vf4TcRAzEXhhGRsR80wANfGKwhh0aDcdkJueo9mOB8-PAeXix-6PmqFT6LGQtrZKgHn58W1fyH6qwBlyxSvYPZLOHwgKcA184AmE4_opxR7fqxqbaawGqHHZfvWcJq7VW01hXFQlUL9s4hCNMZFKhvKAXB5jUGMqeIc8c2yDize20UPJpiREcLU5kbvR7UY_V6Yd4s6qkrCWZj143dNLacJ5M_5pNWh7omd1S0jMWYEKnCn6sl9qVLKNLb2uxVqOq7ubTSf4_BmN5sWTS9d1XkX4d8I9nvOjGMG30y6qK5DpXotdzfcOz1OSIVDZbt0Xp6ivnSIHsIpZoLTA6HV81kvx5d6O5drpVJmXXk9Zzr1she-nyhq0hdY7hAdnXxgJtKfKHpMTFERE5dZpRpwiVFqnTZE41dQoo1AUZdToTNKNOU-o0p05T6sek2u0M2w-GesXxQrUnGacvb56QHTEX_IxQ0GIblu_5nFlNG9sJ2b4bRa7NTHPiMe6ck9MtD7nYunNJKjnHK7IbwUfFr0FDS4Kq_B2_ACDYSco
link.rule.ids 783
linkProvider Clarivate
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Evaluating+Synthetic+Data+Augmentation+to+Correct+for+Data+Imbalance+in+Realistic+Clinical+Prediction+Settings&rft.jtitle=Studies+in+health+technology+and+informatics&rft.au=Wahler%2C+Nina&rft.au=Kaabachi%2C+Bayrem&rft.au=Kulynych%2C+Bogdan&rft.au=Despraz%2C+J%C3%A9r%C3%A9mie&rft.date=2024-08-22&rft.eissn=1879-8365&rft.volume=316&rft.spage=929&rft_id=info%3Apmid%2F39176944&rft_id=info%3Apmid%2F39176944&rft.externalDocID=39176944