Adapted large language models can outperform medical experts in clinical text summarization
Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range o...
Saved in:
Published in | Nature medicine Vol. 30; no. 4; pp. 1134 - 1142 |
---|---|
Main Authors | , , , , , , , , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
Nature Publishing Group US
01.04.2024
Nature Publishing Group |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Analyzing vast textual data and summarizing key information from electronic health records imposes a substantial burden on how clinicians allocate their time. Although large language models (LLMs) have shown promise in natural language processing (NLP) tasks, their effectiveness on a diverse range of clinical summarization tasks remains unproven. Here we applied adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks: radiology reports, patient questions, progress notes and doctor–patient dialogue. Quantitative assessments with syntactic, semantic and conceptual NLP metrics reveal trade-offs between models and adaptation methods. A clinical reader study with 10 physicians evaluated summary completeness, correctness and conciseness; in most cases, summaries from our best-adapted LLMs were deemed either equivalent (45%) or superior (36%) compared with summaries from medical experts. The ensuing safety analysis highlights challenges faced by both LLMs and medical experts, as we connect errors to potential medical harm and categorize types of fabricated information. Our research provides evidence of LLMs outperforming medical experts in clinical text summarization across multiple tasks. This suggests that integrating LLMs into clinical workflows could alleviate documentation burden, allowing clinicians to focus more on patient care.
Comparative performance assessment of large language models identified ChatGPT-4 as the best-adapted model across a diverse set of clinical text summarization tasks, and it outperformed 10 medical experts in a reader study. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 D.V.V. collected data, developed code, ran experiments, designed reader studies, analyzed results, created figures and wrote the manuscript. All authors reviewed the manuscript and provided meaningful revisions and feedback. C.V.U., L.B. and J.B.D. provided technical advice, in addition to conducting qualitative analysis (C.V.U.), building infrastructure for the Azure API (L.B.) and implementing the MEDCON metric (J.B.). A.A. assisted in model fine-tuning. C.B., A.P., M.P., E.P.R. and A.S. participated in the reader study as radiologists. N.R., P.H., W.C., N.A. and J.H. participated in the reader study as hospitalists. C.P.L., J.P. and A.S.C. provided student funding. S.G. advised on study design, for which J.H. and J.P. provided additional feedback. J.P. and A.S.C. guided the project, with A.S.C. serving as principal investigator and advising on technical details and overall direction. No funders or third parties were involved in study design, analysis or writing. Author contributions |
ISSN: | 1078-8956 1546-170X 1546-170X |
DOI: | 10.1038/s41591-024-02855-5 |