Prediction of lung cancer patient survival via supervised machine learning classification techniques

•Compared supervised machine learning algorithms to determine predictive correlation.•The models perform well with low to moderate lung cancer patient survival times.•Created a custom ensemble, enabling evaluation of each model’s predictive power.•Results of the models are consistent with a classica...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of medical informatics (Shannon, Ireland) Vol. 108; pp. 1 - 8
Main Authors Lynch, Chip M., Abdollahi, Behnaz, Fuqua, Joshua D., de Carlo, Alexandra R., Bartholomai, James A., Balgemann, Rayeanne N., van Berkel, Victor H., Frieboes, Hermann B.
Format Journal Article
LanguageEnglish
Published Ireland Elsevier B.V 01.12.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•Compared supervised machine learning algorithms to determine predictive correlation.•The models perform well with low to moderate lung cancer patient survival times.•Created a custom ensemble, enabling evaluation of each model’s predictive power.•Results of the models are consistent with a classical Cox proportional hazards model. Outcomes for cancer patients have been previously estimated by applying various machine learning techniques to large datasets such as the Surveillance, Epidemiology, and End Results (SEER) program database. In particular for lung cancer, it is not well understood which types of techniques would yield more predictive information, and which data attributes should be used in order to determine this information. In this study, a number of supervised learning techniques is applied to the SEER database to classify lung cancer patients in terms of survival, including linear regression, Decision Trees, Gradient Boosting Machines (GBM), Support Vector Machines (SVM), and a custom ensemble. Key data attributes in applying these methods include tumor grade, tumor size, gender, age, stage, and number of primaries, with the goal to enable comparison of predictive power between the various methods The prediction is treated like a continuous target, rather than a classification into categories, as a first step towards improving survival prediction. The results show that the predicted values agree with actual values for low to moderate survival times, which constitute the majority of the data. The best performing technique was the custom ensemble with a Root Mean Square Error (RMSE) value of 15.05. The most influential model within the custom ensemble was GBM, while Decision Trees may be inapplicable as it had too few discrete outputs. The results further show that among the five individual models generated, the most accurate was GBM with an RMSE value of 15.32. Although SVM underperformed with an RMSE value of 15.82, statistical analysis singles the SVM as the only model that generated a distinctive output. The results of the models are consistent with a classical Cox proportional hazards model used as a reference technique. We conclude that application of these supervised learning techniques to lung cancer data in the SEER database may be of use to estimate patient survival time with the ultimate goal to inform patient care decisions, and that the performance of these techniques with this particular dataset may be on par with that of classical methods.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1386-5056
1872-8243
DOI:10.1016/j.ijmedinf.2017.09.013