Automated real-world data integration improves cancer outcome prediction

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing an...

Full description

Saved in:
Bibliographic Details
Published inNature (London) Vol. 636; no. 8043; pp. 728 - 736
Main Authors Jee, Justin, Fong, Christopher, Pichotta, Karl, Tran, Thinh Ngoc, Luthra, Anisha, Waters, Michele, Fu, Chenlian, Altoe, Mirella, Liu, Si-Yang, Maron, Steven B., Ahmed, Mehnaj, Kim, Susie, Pirun, Mono, Chatila, Walid K., de Bruijn, Ino, Pasha, Arfath, Kundra, Ritika, Gross, Benjamin, Mastrogiacomo, Brooke, Aprati, Tyler J., Liu, David, Gao, JianJiong, Capelletti, Marzia, Pekala, Kelly, Loudon, Lisa, Perry, Maria, Bandlamudi, Chaitanya, Donoghue, Mark, Satravada, Baby Anusha, Martin, Axel, Shen, Ronglai, Chen, Yuan, Brannon, A. Rose, Chang, Jason, Braunstein, Lior, Li, Anyi, Safonov, Anton, Stonestrom, Aaron, Sanchez-Vela, Pablo, Wilhelm, Clare, Robson, Mark, Scher, Howard, Ladanyi, Marc, Reis-Filho, Jorge S., Solit, David B., Jones, David R., Gomez, Daniel, Yu, Helena, Chakravarty, Debyani, Yaeger, Rona, Abida, Wassim, Park, Wungki, O’Reilly, Eileen M., Garcia-Aguilar, Julio, Socci, Nicholas, Sanchez-Vega, Francisco, Carrot-Zhang, Jian, Stetson, Peter D., Levine, Ross, Rudin, Charles M., Berger, Michael F., Shah, Sohrab P., Schrag, Deborah, Razavi, Pedram, Kehl, Kenneth L., Li, Bob T., Riely, Gregory J., Schultz, Nikolaus
Format Journal Article
LanguageEnglish
Published London Nature Publishing Group UK 19.12.2024
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations 1 , 2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung ( n  = 7,809), breast ( n  = 5,368), colorectal ( n  = 5,543), prostate ( n  = 3,211) and pancreatic ( n  = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research. A study generates a clinicogenomics dataset resource, MSK-CHORD, that combines natural language processing-derived clinical annotations with patient medical data from various sources to improve models of cancer outcome.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0028-0836
1476-4687
1476-4687
DOI:10.1038/s41586-024-08167-5