Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology

Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related...

Full description

Saved in:
Bibliographic Details
Published inFrontiers in medicine Vol. 12; p. 1495378
Main Authors Salman, Ibrahim M., Ameer, Omar Z., Khanfar, Mohammad A., Hsieh, Yee-Hsee
Format Journal Article
LanguageEnglish
Published Switzerland Frontiers Media S.A 19.02.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related to cardiovascular pharmacology, a key subject in healthcare education. Using free versions of each AI tool, we administered 45 MCQs and 30 SAQs across three difficulty levels: easy, intermediate, and advanced. AI-generated answers were reviewed by three pharmacology experts. The accuracy of MCQ responses was recorded as correct or incorrect, while SAQ responses were rated on a 1-5 scale based on relevance, completeness, and correctness. ChatGPT, Copilot, and Gemini demonstrated high accuracy scores in easy and intermediate MCQs (87-100%). While all AI models showed a decline in performance on the advanced MCQ section, only Copilot (53% accuracy) and Gemini (20% accuracy) had significantly lower scores compared to their performance on easy-intermediate levels. SAQ evaluations revealed high accuracy scores for ChatGPT (overall 4.7 ± 0.3) and Copilot (overall 4.5 ± 0.4) across all difficulty levels, with no significant differences between the two tools. In contrast, Gemini's SAQ performance was markedly lower across all levels (overall 3.3 ± 1.0). ChatGPT-4 demonstrates the highest accuracy in addressing both MCQ and SAQ cardiovascular pharmacology questions, regardless of difficulty level. Copilot ranks second after ChatGPT, while Google Gemini shows significant limitations in handling complex MCQs and providing accurate responses to SAQ-type questions in this field. These findings can guide the ongoing refinement of AI tools for specialized medical education.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Edited by: Jacqueline G. Bloomfield, The University of Sydney, Australia
Reviewed by: Nataly Martini, The University of Auckland, New Zealand
Rebecca S. Koszalinski, University of Central Florida, United States
ISSN:2296-858X
2296-858X
DOI:10.3389/fmed.2025.1495378