Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study

Objective: To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies. Materials and Methods: In this multi-institutional...

Full description

Saved in:

Bibliographic Details
Published in	Korean journal of radiology Vol. 26; no. 9; pp. 817 - 831
Main Authors	Min, Dabin, Jin, Kwang Nam, Bang, SangHeum, Kim, Moon Young, Kim, Hack-Lyoung, Jeong, Won Gi, Lee, Hye-Jeong, Beck, Kyongmin Sarah, Hwang, Sung Ho, Kim, Eun Young, Park, Chang Min
Format	Journal Article
Language	English
Published	대한영상의학회 01.09.2025
Subjects	방사선과학
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Objective: To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies. Materials and Methods: In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar’s test. Results: LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (≥0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 (P < 0.001) and plaque burden accuracy by 0.152 (P < 0.001) in external testing. Conclusion: LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting. KCI Citation Count: 0
Bibliography:	https://doi.org/10.3348/kjr.2025.0293
ISSN:	1229-6929 2005-8330
DOI:	10.3348/kjr.2025.0293