Automatic information extraction from childhood cancer pathology reports
Objectives The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep...
Saved in:
Published in | JAMIA open Vol. 5; no. 2; p. ooac049 |
---|---|
Main Authors | , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Oxford University Press
01.07.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Objectives
The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, we describe extending the models to perform ICCC classification.
Materials and Methods
We developed 2 models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and 4 scenarios subject to the training sample size. We evaluated these models with a corpus consisting of 29 206 reports with age at diagnosis between 0 and 19 from 6 state cancer registries.
Results
Our findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports.
Conclusions
Our experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably.
Lay Summary
ICCC is the coding standard designed to categorize childhood cancers. However, machine learning-based ICCC classification has not been extensively studied, mainly owing to the limited volume of the pediatric cancer corpus; pediatric cancer is much less prevalent than adult cancers. Under the oversight of the National Childhood Cancer Registry project, we developed a deep learning-based text comprehension model for classifying ICCC from childhood cancer pathology reports. We performed a comparison study between (1) classifying ICD-O-3 codes and then recoding into ICCC and (2) classifying ICCC codes directly. We observed that the second approach exhibited a substantially higher accuracy score.We are aware that the low-precision models are not appropriate for this exercise because they will degrade the credibility of the model-based decisions. We applied an uncertainty quantification algorithm to the ICCC classification model. We achieved nearly perfect accuracy scores, while the model passed over 14.8% of ambiguous cases. This result means our machine learning model can serve human annotators at state cancer registries by processing 85.2% of the childhood cancer pathology reports automatically. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE AC05-00OR22725 |
ISSN: | 2574-2531 2574-2531 |
DOI: | 10.1093/jamiaopen/ooac049 |