Synthetic Data Generation in Small Datasets to Improve Classification Performance for Chronic Heart Failure Prediction
In cardiology, we frequently develop machine learning models to predict events such as heart failure. Oftentimes, these events occur at low incidence in the available data, especially for under-represented subpopulations, which limits classifier performance due to class imbalance. To mitigate these...
Saved in:
Published in | Computing in cardiology Vol. 50; pp. 1 - 4 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
CinC
01.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In cardiology, we frequently develop machine learning models to predict events such as heart failure. Oftentimes, these events occur at low incidence in the available data, especially for under-represented subpopulations, which limits classifier performance due to class imbalance. To mitigate these issues, we investigate the use of synthetic data generation, or algorithms trained to mimic realistic patient data. In particular, we use synthetic data to augment training data for Catboost in classifying chronic heart failure using the University of California, Irvine myocardial infarction complications dataset (n = 1,700). Our primary metrics of interest are the mean and the variability of AUC and F1-Score across five-fold cross-validation. Overall, we find modest gains in performance over the baseline classifier with no augmented data. Nevertheless, the more sophisticated generators, both with and without hyperparameter tuning, did not confer better performance than simpler methods. Furthermore, all methods were subject to large variability in classification metrics across folds. While synthetic data generation is a promising tool for class imbalance, more investigation is needed to find optimal sample sizes and settings for the stability of results. |
---|---|
ISSN: | 2325-887X |
DOI: | 10.22489/CinC.2023.081 |