Comparison of Machine Learning and Deep Learning Methods for the Prediction of Osteoradionecrosis Resulting from Head and Neck Cancer Radiation Therapy

To test the hypothesis that deep learning (DL) techniques, using full dose distributions, can outperform machine learning (ML) methods, using dose summary statistics, in the prediction of osteoradionecrosis (ORN) resulting from head and neck cancer (HNC) radiotherapy (RT). 1259 subjects from a singl...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of radiation oncology, biology, physics Vol. 114; no. 3; p. e124
Main Authors Reber, B., van Dijk, L.V., Anderson, B.M., Mohamed, A.S., Rigaud, B., He, Y., Woodland, M., Fuller, C.D., Lai, S.Y., Brock, K.K.
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.11.2022
Online AccessGet full text

Cover

Loading…
More Information
Summary:To test the hypothesis that deep learning (DL) techniques, using full dose distributions, can outperform machine learning (ML) methods, using dose summary statistics, in the prediction of osteoradionecrosis (ORN) resulting from head and neck cancer (HNC) radiotherapy (RT). 1259 subjects from a single institution were identified who received HNC RT with curative intent. All 1259 subjects were included in the ML study and 1236 subjects with available dose maps and mandible contours were included in the DL study. After two years of follow-up, 173 patients developed ORN of any grade and 1086 remained ORN free (171 ORN+/1064 ORN- in the DL cohort). The ML methods, including logistic regression (LR), random forest (RF), support vector machine (SVM), principal component regression (PCR), and XGBoost, predict ORN status using subject dose summary statistics. The DL methods, including ResNet, DenseNet, DenseNet+ResNet ensemble, and autoencoder architectures, used subject 3D dose maps constrained to a bounding box around the mandible contour to predict ORN status. The autoencoder architecture uses bottleneck features with convolutional layers for prediction. The impact of training set size on DL performance was evaluated by retraining the architectures on decreasing ratios of the original training dataset (100% to 10% in 10% decrements). Model prediction performance was quantified using recall, precision, balanced accuracy, and area under the precision recall curve (AUPRC). The ML results are the average of 10-fold stratified cross-validation with 3 repeats whereas DL results are from a withheld test set (650/217/369 train/validation/test case split with 111/12/48 ORN+ cases per set, respectively). Class imbalance in the DL models was handled by randomly oversampling ORN+ cases in the training set to match the number of ORN- cases. The table shows the ML and DL ORN prediction results. Decreasing the amount of training data had no impact on DL performance; in the extreme of training the DL models on 10% of the training data, the balanced accuracy and F1 score did not decrease. The traditional ML models had superior performance compared to the DL models. The lack of improvement in DL performance when increasing the amount of available training data suggests that either significantly more data is needed for DL model construction and/or that low-level dose image features are not powerful for this task. The poor DL performance despite a relatively large training cohort suggest additional imaging modalities in conjunction with 3D dose maps should be explored.
ISSN:0360-3016
DOI:10.1016/j.ijrobp.2022.07.946