Synthetic Data Generation in Small Datasets to Improve Classification Performance for Chronic Heart Failure Prediction

In cardiology, we frequently develop machine learning models to predict events such as heart failure. Oftentimes, these events occur at low incidence in the available data, especially for under-represented subpopulations, which limits classifier performance due to class imbalance. To mitigate these...

Full description

Saved in:

Bibliographic Details
Published in	Computing in cardiology Vol. 50; pp. 1 - 4
Main Authors	Zawadzki, Roy S, Parvaneh, Saman
Format	Conference Proceeding
Language	English
Published	CinC 01.10.2023
Subjects	Cardiovascular diseases Measurement Myocardium Prediction algorithms Predictive models Stability analysis Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In cardiology, we frequently develop machine learning models to predict events such as heart failure. Oftentimes, these events occur at low incidence in the available data, especially for under-represented subpopulations, which limits classifier performance due to class imbalance. To mitigate these issues, we investigate the use of synthetic data generation, or algorithms trained to mimic realistic patient data. In particular, we use synthetic data to augment training data for Catboost in classifying chronic heart failure using the University of California, Irvine myocardial infarction complications dataset (n = 1,700). Our primary metrics of interest are the mean and the variability of AUC and F1-Score across five-fold cross-validation. Overall, we find modest gains in performance over the baseline classifier with no augmented data. Nevertheless, the more sophisticated generators, both with and without hyperparameter tuning, did not confer better performance than simpler methods. Furthermore, all methods were subject to large variability in classification metrics across folds. While synthetic data generation is a promising tool for class imbalance, more investigation is needed to find optimal sample sizes and settings for the stability of results.
ISSN:	2325-887X
DOI:	10.22489/CinC.2023.081