Comparative Methods for Addressing Imbalanced Datasets in Predicting Medical Appointment No-Shows

The efficiency and accessibility of healthcare delivery systems can be enhanced through solutions that minimize the impact of patient No-Shows for medical exams and appointments. The significant disparity between the "Show" and "No-Show" categories within the dataset can impair t...

Full description

Saved in:

Bibliographic Details
Published in	2024 L Latin American Computer Conference (CLEI) pp. 1 - 10
Main Authors	Lovatte, Marcelo Ardizzon, Resendo, Leandro Colombi, Komati, Karin Satie
Format	Conference Proceeding
Language	English
Published	IEEE 12.08.2024
Subjects	Accuracy ADASYN Classification algorithms Classification tree analysis cost-sensitive learning Measurement missed appointments oversampling Predictive models Random forests Support vector machines Training Training data undersampling
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The efficiency and accessibility of healthcare delivery systems can be enhanced through solutions that minimize the impact of patient No-Shows for medical exams and appointments. The significant disparity between the "Show" and "No-Show" categories within the dataset can impair the efficacy of predictive models, necessitating the employment of dataset balancing techniques before classifier training. This study evaluated various methods to address imbalances in datasets for predicting patient No-Shows. Techniques such as undersampling (Random Removal - RR, Remove Similar - RS, Remove Farthest - RF) and oversampling (Adaptive Synthetic Sampling - ADASYN) were applied to adjust the balance between Show and No-Show categories to ratios of 80-20%, 70-30%, 60-40%, and 50-50%, alongside a cost-sensitive learning approach. Four classifiers were deployed: Support Vector Machine (SVM), Naive Bayes, k-Nearest Neighbors (k-NN), and Random Forests. Additionally, a decision tree produced by the C4.5 algorithm was utilized for the cost-sensitive learning approach. The classifiers were evaluated using metrics such as Precision, Recall, F-measure, and AUC/ROC. Among the various methods tested, the RR combined with the k-NN classifier achieved the highest AUC/ROC value. However, due to the longer computational time of k-NN, the Random Forest classifier emerged as a more pragmatic choice when processing time is a critical consideration.
ISSN:	2771-5752
DOI:	10.1109/CLEI64178.2024.10700560