Automatic information extraction from childhood cancer pathology reports

Objectives The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep...

Full description

Saved in:

Bibliographic Details
Published in	JAMIA open Vol. 5; no. 2; p. ooac049
Main Authors	Yoon, Hong-Jun, Peluso, Alina, Durbin, Eric B, Wu, Xiao-Cheng, Stroup, Antoinette, Doherty, Jennifer, Schwartz, Stephen, Wiggins, Charles, Coyle, Linda, Penberthy, Lynne
Format	Journal Article
Language	English
Published	United States Oxford University Press 01.07.2022
Subjects	60 APPLIED LIFE SCIENCES Cancer in children cancer pathology reports Computational linguistics Evaluation information extraction Language processing Machine learning Natural language interfaces pediatric cancer Pediatrics Research and Applications information extraction machine learning cancer pathology reports pediatric cancer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Objectives The International Classification of Childhood Cancer (ICCC) facilitates the effective classification of a heterogeneous group of cancers in the important pediatric population. However, there has been no development of machine learning models for the ICCC classification. We developed deep learning-based information extraction models from cancer pathology reports based on the ICD-O-3 coding standard. In this article, we describe extending the models to perform ICCC classification. Materials and Methods We developed 2 models, ICD-O-3 classification and ICCC recoding (Model 1) and direct ICCC classification (Model 2), and 4 scenarios subject to the training sample size. We evaluated these models with a corpus consisting of 29 206 reports with age at diagnosis between 0 and 19 from 6 state cancer registries. Results Our findings suggest that the direct ICCC classification (Model 2) is substantially better than reusing the ICD-O-3 classification model (Model 1). Applying the uncertainty quantification mechanism to assess the confidence of the algorithm in assigning a code demonstrated that the model achieved a micro-F1 score of 0.987 while abstaining (not sufficiently confident to assign a code) on only 14.8% of ambiguous pathology reports. Conclusions Our experimental results suggest that the machine learning-based automatic information extraction from childhood cancer pathology reports in the ICCC is a reliable means of supplementing human annotators at state cancer registries by reading and abstracting the majority of the childhood cancer pathology reports accurately and reliably. Lay Summary ICCC is the coding standard designed to categorize childhood cancers. However, machine learning-based ICCC classification has not been extensively studied, mainly owing to the limited volume of the pediatric cancer corpus; pediatric cancer is much less prevalent than adult cancers. Under the oversight of the National Childhood Cancer Registry project, we developed a deep learning-based text comprehension model for classifying ICCC from childhood cancer pathology reports. We performed a comparison study between (1) classifying ICD-O-3 codes and then recoding into ICCC and (2) classifying ICCC codes directly. We observed that the second approach exhibited a substantially higher accuracy score.We are aware that the low-precision models are not appropriate for this exercise because they will degrade the credibility of the model-based decisions. We applied an uncertainty quantification algorithm to the ICCC classification model. We achieved nearly perfect accuracy scores, while the model passed over 14.8% of ambiguous cases. This result means our machine learning model can serve human annotators at state cancer registries by processing 85.2% of the childhood cancer pathology reports automatically.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE AC05-00OR22725
ISSN:	2574-2531 2574-2531
DOI:	10.1093/jamiaopen/ooac049