A data-driven architecture using natural language processing to improve phenotyping efficiency and accelerate genetic diagnoses of rare disorders

Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great...

Full description

Saved in:

Bibliographic Details
Published in	HGG advances Vol. 2; no. 3; p. 100035
Main Authors	Parikh, Jignesh R., Genetti, Casie A., Aykanat, Asli, Brownstein, Catherine A., Schmitz-Abe, Klaus, Danowski, Morgan, Quitadomo, Andrew, Madden, Jill A., Yacoubian, Calum, Gain, Richard, Williams, Tessa, Meskell, Mary, Brown, Andrew, Frith, Alison, Rockowitz, Shira, Sliz, Piotr, Agrawal, Pankaj B., Defay, Thomas, McDonagh, Paul, Reynders, John, Lefebvre, Sebastien, Beggs, Alan H.
Format	Journal Article
Language	English
Published	United States Elsevier Inc 08.07.2021 Elsevier
Subjects	electronic health records Genetics Human Phenotype Ontology natural language processing genetics electronic health records natural language processing Human Phenotype Ontology
Online Access	Get full text
ISSN	2666-2477 2666-2477
DOI	10.1016/j.xhgg.2021.100035

Cover

Loading…

More Information
Summary:	Effective genetic diagnosis requires the correlation of genetic variant data with detailed phenotypic information. However, manual encoding of clinical data into machine-readable forms is laborious and subject to observer bias. Natural language processing (NLP) of electronic health records has great potential to enhance reproducibility at scale but suffers from idiosyncrasies in physician notes and other medical records. We developed methods to optimize NLP outputs for automated diagnosis. We filtered NLP-extracted Human Phenotype Ontology (HPO) terms to more closely resemble manually extracted terms and identified filter parameters across a three-dimensional space for optimal gene prioritization. We then developed a tiered pipeline that reduces manual effort by prioritizing smaller subsets of genes to consider for genetic diagnosis. Our filtering pipeline enabled NLP-based extraction of HPO terms to serve as a sufficient replacement for manual extraction in 92% of prospectively evaluated cases. In 75% of cases, the correct causal gene was ranked higher with our applied filters than without any filters. We describe a framework that can maximize the utility of NLP-based phenotype extraction for gene prioritization and diagnosis. The framework is implemented within a cloud-based modular architecture that can be deployed across health and research institutions. Natural language processing (NLP) holds promise for automating the phenotyping process and generating machine-readable lists of codes that describe a patient’s condition. We describe a generalizable framework of sequential filtration steps that can be applied across any electronic health records system, NLP platform, and molecular diagnostic aid to improve the diagnostic utility of NLP-derived phenotypic code lists.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 These authors contributed equally to this work Present address: Sema4, Stamford, CT 06902, USA. Present address: Latent Strategies, LLC, Newton, MA 02465, USA.
ISSN:	2666-2477 2666-2477
DOI:	10.1016/j.xhgg.2021.100035