Identification of high-risk lesions through automated natural language processing (NLP) of pathology reports

Abstract Abstract #3001 Purpose
 Pathology reports contain extensive research information that is inaccessible except through costly and time consuming chart reviews. This is due to the fact that pathology reports are recorded as semi-structured prose with critically important descriptive text inten...

Full description

Saved in:
Bibliographic Details
Published inCancer research (Chicago, Ill.) Vol. 69; no. 2_Supplement; p. 3001
Main Authors Ozanne, EM, Sharko, J, Drohan, B, Grinstein, G, Hughes, KS
Format Journal Article
LanguageEnglish
Published 15.01.2009
Online AccessGet full text

Cover

Loading…
More Information
Summary:Abstract Abstract #3001 Purpose
 Pathology reports contain extensive research information that is inaccessible except through costly and time consuming chart reviews. This is due to the fact that pathology reports are recorded as semi-structured prose with critically important descriptive text intended for human interpretation. Key challenges for processing this data include interpreting multiple methods of describing the same finding, and subsequently aggregating the findings of multiple reports into episodes of care. Investigators tested NLP techniques in the processing of pathology reports into structured data and episodes of care, allowing for the rapid identification and epidemiologic modeling of high-risk breast lesions.
 Methods
 Using state-of-the-art NLP software (ClearForest, A Thomson Reuters Company, Waltham, MA), breast pathology reports stored as text files were processed into a structured electronic database using these steps: 1) identification of diagnosis of interest (i.e. high risk lesions, cancer), 2) use of NLP to identify all terms and phrases used to report each finding (e.g. atypical hyperplasia, hyperplasia with atypia), 3) grouping of relevant terms into categories, 4) identification of categories occurring in each patient report, and 5) grouping of patient reports into episodes of care (defined as all reports within 6 months of an initial diagnosis).
 Results
 Under IRB approval, 27,931 breast pathology reports from Massachusetts General Hospital in 16,208 patients seen between 1990-2007 were analyzed. The results were compared against manually reviewed pathology reports for quality control. For DCIS diagnoses, the initial error rate for both the NLP process and the manual process was 2%. The NLP process was then re-tuned using the identified discrepancies which reduced the error rate to zero. Using the refined model, we identified 1) patients with atypical lesions (atypical ductal hyperplasia (ADH), severe ADH, atypical lobular hyperplasia (ALH), and lobular carcinoma in situ (LCIS)) without prior or concurrent cancer, and 2) patients who developed cancer greater than 6 months post diagnosis.
 
 Conclusion
 This process successfully identified high-risk diagnoses that were otherwise relatively inaccessible, and appears to match the accuracy of a human research associate. The results of this first implementation are promising and will be further validated over time. In the future, this approach can be applied to other medical reports and diseases. NLP has significant potential to decrease the cost of research and for improving patient care. Citation Information: Cancer Res 2009;69(2 Suppl):Abstract nr 3001.
ISSN:0008-5472
1538-7445
DOI:10.1158/0008-5472.SABCS-3001