Evaluating Synthetic Data Augmentation to Correct for Data Imbalance in Realistic Clinical Prediction Settings

Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced cl...

Full description

Saved in:
Bibliographic Details
Published inStudies in health technology and informatics Vol. 316; p. 929
Main Authors Wahler, Nina, Kaabachi, Bayrem, Kulynych, Bogdan, Despraz, Jérémie, Simon, Christian, Raisaro, Jean Louis
Format Journal Article
LanguageEnglish
Published Netherlands 22.08.2024
Subjects
Online AccessGet more information

Cover

Loading…
More Information
Summary:Predictive modeling holds a large potential in clinical decision-making, yet its effectiveness can be hindered by inherent data imbalances in clinical datasets. This study investigates the utility of synthetic data for improving the performance of predictive modeling on realistic small imbalanced clinical datasets. We compared various synthetic data generation methods including Generative Adversarial Networks, Normalizing Flows, and Variational Autoencoders to the standard baselines for correcting for class underrepresentation on four clinical datasets. Although results show improvement in F1 scores in some cases, even over multiple repetitions, we do not obtain statistically significant evidence that synthetic data generation outperforms standard baselines for correcting for class imbalance. This study challenges common beliefs about the efficacy of synthetic data for data augmentation and highlights the importance of evaluating new complex methods against simple baselines.
ISSN:1879-8365