A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models

The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature. We conducted a Medline literature search (1/2016 to 8/2017) and extracted comparisons between LR and ML models for binary outcomes. We inc...

Full description

Saved in:

Bibliographic Details
Published in	Journal of clinical epidemiology Vol. 110; pp. 12 - 22
Main Authors	Christodoulou, Evangelia, Ma, Jie, Collins, Gary S., Steyerberg, Ewout W., Verbakel, Jan Y., Van Calster, Ben
Format	Journal Article
Language	English
Published	United States Elsevier Inc 01.06.2019 Elsevier Limited
Subjects	Algorithms Area Under Curve Artificial intelligence Artificial neural networks AUC Bias Calibration Classification Clinical prediction models Confidence intervals Discriminant analysis Epidemiology Humans Learning algorithms Logistic Models Logistic regression Machine learning Mathematical models Medical research Modelling Models, Theoretical Neural networks Outcome Assessment, Health Care Prediction models Predictive Value of Tests Regression analysis Reporting Sensitivity and Specificity Statistical analysis Supervised Machine Learning Support vector machines Systematic review Clinical prediction models Logistic regression Calibration Reporting Machine learning AUC
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The objective of this study was to compare performance of logistic regression (LR) with machine learning (ML) for clinical prediction modeling in the literature. We conducted a Medline literature search (1/2016 to 8/2017) and extracted comparisons between LR and ML models for binary outcomes. We included 71 of 927 studies. The median sample size was 1,250 (range 72–3,994,872), with 19 predictors considered (range 5–563) and eight events per predictor (range 0.3–6,697). The most common ML methods were classification trees, random forests, artificial neural networks, and support vector machines. In 48 (68%) studies, we observed potential bias in the validation procedures. Sixty-four (90%) studies used the area under the receiver operating characteristic curve (AUC) to assess discrimination. Calibration was not addressed in 56 (79%) studies. We identified 282 comparisons between an LR and ML model (AUC range, 0.52–0.99). For 145 comparisons at low risk of bias, the difference in logit(AUC) between LR and ML was 0.00 (95% confidence interval, −0.18 to 0.18). For 137 comparisons at high risk of bias, logit(AUC) was 0.34 (0.20–0.47) higher for ML. We found no evidence of superior performance of ML over LR. Improvements in methodology and reporting are needed for studies that compare modeling algorithms.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 content type line 14 ObjectType-Feature-3 ObjectType-Evidence Based Healthcare-1 ObjectType-Feature-1 content type line 23 ObjectType-Undefined-3
ISSN:	0895-4356 1878-5921 1878-5921
DOI:	10.1016/j.jclinepi.2019.02.004