Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology

Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers in medicine Vol. 12; p. 1495378
Main Authors	Salman, Ibrahim M., Ameer, Omar Z., Khanfar, Mohammad A., Hsieh, Yee-Hsee
Format	Journal Article
Language	English
Published	Switzerland Frontiers Media S.A 19.02.2025
Subjects	cardiovascular pharmacology ChatGPT Copilot Google Gemini medical education Medicine pharmacy and medicine pharmacy and medicine Copilot medical education cardiovascular pharmacology ChatGPT Google Gemini
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related to cardiovascular pharmacology, a key subject in healthcare education. Using free versions of each AI tool, we administered 45 MCQs and 30 SAQs across three difficulty levels: easy, intermediate, and advanced. AI-generated answers were reviewed by three pharmacology experts. The accuracy of MCQ responses was recorded as correct or incorrect, while SAQ responses were rated on a 1-5 scale based on relevance, completeness, and correctness. ChatGPT, Copilot, and Gemini demonstrated high accuracy scores in easy and intermediate MCQs (87-100%). While all AI models showed a decline in performance on the advanced MCQ section, only Copilot (53% accuracy) and Gemini (20% accuracy) had significantly lower scores compared to their performance on easy-intermediate levels. SAQ evaluations revealed high accuracy scores for ChatGPT (overall 4.7 ± 0.3) and Copilot (overall 4.5 ± 0.4) across all difficulty levels, with no significant differences between the two tools. In contrast, Gemini's SAQ performance was markedly lower across all levels (overall 3.3 ± 1.0). ChatGPT-4 demonstrates the highest accuracy in addressing both MCQ and SAQ cardiovascular pharmacology questions, regardless of difficulty level. Copilot ranks second after ChatGPT, while Google Gemini shows significant limitations in handling complex MCQs and providing accurate responses to SAQ-type questions in this field. These findings can guide the ongoing refinement of AI tools for specialized medical education.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Edited by: Jacqueline G. Bloomfield, The University of Sydney, Australia Reviewed by: Nataly Martini, The University of Auckland, New Zealand Rebecca S. Koszalinski, University of Central Florida, United States
ISSN:	2296-858X 2296-858X
DOI:	10.3389/fmed.2025.1495378