Adapted large language models can outperform medical experts in clinical text summarization

Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range o...

Full description

Saved in:
Bibliographic Details
Published inNature medicine Vol. 30; no. 4; pp. 1134 - 1142
Main Authors Van Veen, Dave, Van Uden, Cara, Blankemeier, Louis, Delbrouck, Jean-Benoit, Aali, Asad, Bluethgen, Christian, Pareek, Anuj, Polacin, Malgorzata, Reis, Eduardo Pontes, Seehofnerová, Anna, Rohatgi, Nidhi, Hosamani, Poonam, Collins, William, Ahuja, Neera, Langlotz, Curtis P., Hom, Jason, Gatidis, Sergios, Pauly, John, Chaudhari, Akshay S.
Format Journal Article
LanguageEnglish
Published New York Nature Publishing Group US 01.04.2024
Nature Publishing Group
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range of clinical summarization tasks remains unproven. Here we applied adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes and doctor–patient dialogue. Quantitative assessments with syntactic, semantic and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care. Comparative performance assessment of large language models identified ChatGPT-4 as the best-adapted model across a diverse set of clinical text summarization tasks, and it outperformed 10 medical experts in a reader study.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
D.V.V. collected data, developed code, ran experiments, designed reader studies, analyzed results, created figures and wrote the manuscript. All authors reviewed the manuscript and provided meaningful revisions and feedback. C.V.U., L.B. and J.B.D. provided technical advice, in addition to conducting qualitative analysis (C.V.U.), building infrastructure for the Azure API (L.B.) and implementing the MEDCON metric (J.B.). A.A. assisted in model fine-tuning. C.B., A.P., M.P., E.P.R. and A.S. participated in the reader study as radiologists. N.R., P.H., W.C., N.A. and J.H. participated in the reader study as hospitalists. C.P.L., J.P. and A.S.C. provided student funding. S.G. advised on study design, for which J.H. and J.P. provided additional feedback. J.P. and A.S.C. guided the project, with A.S.C. serving as principal investigator and advising on technical details and overall direction. No funders or third parties were involved in study design, analysis or writing.
Author contributions
ISSN:1078-8956
1546-170X
1546-170X
DOI:10.1038/s41591-024-02855-5