Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology

This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases. LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in...

Full description

Saved in:

Bibliographic Details
Published in	Computational and structural biotechnology journal Vol. 23; pp. 4019 - 4026
Main Authors	Gumilar, Khanisyah Erza, Indraprasta, Birama R., Faridzi, Ach Salman, Wibowo, Bagus M., Herlambang, Aditya, Rahestyningtyas, Eccita, Irawan, Budi, Tambunan, Zulkarnain, Bustomi, Ahmad Fadhli, Brahmantara, Bagus Ngurah, Yu, Zih-Ying, Hsu, Yu-Cheng, Pramuditya, Herlangga, Putra, Very Great E., Nugroho, Hari, Mulawardhana, Pungky, Tjokroprawiro, Brahmana A., Hedianto, Tri, Ibrahim, Ibrahim H., Huang, Jingshan, Li, Dongqi, Lu, Chien-Hsing, Yang, Jer-Yen, Liao, Li-Na, Tan, Ming
Format	Journal Article
Language	English
Published	Netherlands Elsevier B.V 01.12.2024 Research Network of Computational and Structural Biotechnology Elsevier
Subjects	Accuracy Artificial intelligence biotechnology Consistency decision making Gynecologic cancer Large Language Models patients Accuracy Large Language Models Artificial intelligence Gynecologic cancer Consistency
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This study investigated the ability of Large Language Models (LLMs) to provide accurate and consistent answers by focusing on their performance in complex gynecologic cancer cases. LLMs are advancing rapidly and require a thorough evaluation to ensure that they can be safely and effectively used in clinical decision-making. Such evaluations are essential for confirming LLM reliability and accuracy in supporting medical professionals in casework. We assessed three prominent LLMs—ChatGPT-4 (CG-4), Gemini Advanced (GemAdv), and Copilot—evaluating their accuracy, consistency, and overall performance. Fifteen clinical vignettes of varying difficulty and five open-ended questions based on real patient cases were used. The responses were coded, randomized, and evaluated blindly by six expert gynecologic oncologists using a 5-point Likert scale for relevance, clarity, depth, focus, and coherence. GemAdv demonstrated superior accuracy (81.87 %) compared to both CG-4 (61.60 %) and Copilot (70.67 %) across all difficulty levels. GemAdv consistently provided correct answers more frequently (>60 % every day during the testing period). Although CG-4 showed a slight advantage in adhering to the National Comprehensive Cancer Network (NCCN) treatment guidelines, GemAdv excelled in the depth and focus of the answers provided, which are crucial aspects of clinical decision-making. LLMs, especially GemAdv, show potential in supporting clinical practice by providing accurate, consistent, and relevant information for gynecologic cancer. However, further refinement is needed for more complex scenarios. This study highlights the promise of LLMs in gynecologic oncology, emphasizing the need for ongoing development and rigorous evaluation to maximize their clinical utility and reliability. [Display omitted] •Large Language Models (LLMs) are valuable tools in clinical practice, aiding healthcare professionals in making evidence-based decisions and improving patient care.•Gemini Advanced achieved 81.87 % accuracy in clinical decision-making.•Gemini Advanced consistently provided correct answers > 60 % every day during the testing period.•ChatGPT-4 and Gemini Advanced outperformed Copilot in treatment recommendations.•Further improvements are necessary to ensure accurate and relevant responses across clinical scenarios.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2001-0370 2001-0370
DOI:	10.1016/j.csbj.2024.10.050