A data-driven architecture using natural language processing to improve phenotyping efficiency and accelerate genetic diagnoses of rare disorders
Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great...
Saved in:
Published in | HGG advances Vol. 2; no. 3; p. 100035 |
---|---|
Main Authors | , , , , , , , , , , , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Inc
08.07.2021
Elsevier |
Subjects | |
Online Access | Get full text |
ISSN | 2666-2477 2666-2477 |
DOI | 10.1016/j.xhgg.2021.100035 |
Cover
Loading…
Summary: | Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great potential to enhance reproducibility at scale but suffers from idiosyncrasies in physician notes and other medical records. We developed methods to optimize NLP outputs for automated diagnosis. We filtered NLP-extracted Human Phenotype Ontology (HPO) terms to more closely resemble manually extracted terms and identified filter parameters across a three-dimensional space for optimal gene prioritization. We then developed a tiered pipeline that reduces manual effort by prioritizing smaller subsets of genes to consider for genetic diagnosis. Our filtering pipeline enabled NLP-based extraction of HPO terms to serve as a sufficient replacement for manual extraction in 92% of prospectively evaluated cases. In 75% of cases, the correct causal gene was ranked higher with our applied filters than without any filters. We describe a framework that can maximize the utility of NLP-based phenotype extraction for gene prioritization and diagnosis. The framework is implemented within a cloud-based modular architecture that can be deployed across health and research institutions.
Natural language processing (NLP) holds promise for automating the phenotyping process and generating machine-readable lists of codes that describe a patient’s condition. We describe a generalizable framework of sequential filtration steps that can be applied across any electronic health records system, NLP platform, and molecular diagnostic aid to improve the diagnostic utility of NLP-derived phenotypic code lists. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 These authors contributed equally to this work Present address: Sema4, Stamford, CT 06902, USA. Present address: Latent Strategies, LLC, Newton, MA 02465, USA. |
ISSN: | 2666-2477 2666-2477 |
DOI: | 10.1016/j.xhgg.2021.100035 |