Comparison of linear, penalized linear and machine learning models predicting hospital visit costs from chronic disease in Thailand

Generally, health care costs from chronic diseases have positive skew and this gives problems on using traditional statistical models. Machine learning is a conventional method producing accurate prediction with large sample size. However, much of the comparison performance between statistical metho...

Full description

Saved in:
Bibliographic Details
Published inInformatics in medicine unlocked Vol. 26; p. 100769
Main Authors Thongpeth, Wichayaporn, Lim, Apiradee, Wongpairin, Akemat, Thongpeth, Thaworn, Chaimontree, Santhana
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 2021
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Generally, health care costs from chronic diseases have positive skew and this gives problems on using traditional statistical models. Machine learning is a conventional method producing accurate prediction with large sample size. However, much of the comparison performance between statistical methods and machine learning for such data remains scattered. This study aimed to compare linear, penalized linear and machine learning models for their prediction performance of hospital visit costs from chronic disease, in Thailand. A total of 18,342 hospital visit records were obtained from Suratthani tertiary hospital in southern Thailand, which contained data from 2016 on chronic patients of Diagnosis-Related Groups (DRGs). The prediction performance on hospital visit costs by linear, penalized linear and machine learning models were compared using both original dataset and datasets expanded in size two- and four-fold by using bootstrap. The mean age of patients was 56.3 ± 22.6 years with 55.6% of visits by males. The median hospital cost was 16,662 Baht per visit. The random forest (RF) model had the best predictive performance of hospital visit costs for all sizes of dataset with the smallest prediction errors, whereas ridge linear regression had the poorest prediction performance with the largest prediction errors. Machine learning models had better prediction performance with enlarged sample sizes whereas linear and penalized linear models did not. On modeling big data for prediction, machine learning models are preferable, whereas linear and penalized linear models' predictions are not affected by increasing the sample size. •Random forest model has the best predictive performance for hospital visit costs.•Prediction of linear regression and penalized linear regression models are not affected by the increasing sample size.•Machine learning models are appropriate for predication performance in large sample size.
ISSN:2352-9148
2352-9148
DOI:10.1016/j.imu.2021.100769