Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology
Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related...
Saved in:
Published in | Frontiers in medicine Vol. 12; p. 1495378 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Switzerland
Frontiers Media S.A
19.02.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Artificial intelligence (AI) is revolutionizing medical education; however, its limitations remain underexplored. This study evaluated the accuracy of three generative AI tools-ChatGPT-4, Copilot, and Google Gemini-in answering multiple-choice questions (MCQ) and short-answer questions (SAQ) related to cardiovascular pharmacology, a key subject in healthcare education.
Using free versions of each AI tool, we administered 45 MCQs and 30 SAQs across three difficulty levels: easy, intermediate, and advanced. AI-generated answers were reviewed by three pharmacology experts. The accuracy of MCQ responses was recorded as correct or incorrect, while SAQ responses were rated on a 1-5 scale based on relevance, completeness, and correctness.
ChatGPT, Copilot, and Gemini demonstrated high accuracy scores in easy and intermediate MCQs (87-100%). While all AI models showed a decline in performance on the advanced MCQ section, only Copilot (53% accuracy) and Gemini (20% accuracy) had significantly lower scores compared to their performance on easy-intermediate levels. SAQ evaluations revealed high accuracy scores for ChatGPT (overall 4.7 ± 0.3) and Copilot (overall 4.5 ± 0.4) across all difficulty levels, with no significant differences between the two tools. In contrast, Gemini's SAQ performance was markedly lower across all levels (overall 3.3 ± 1.0).
ChatGPT-4 demonstrates the highest accuracy in addressing both MCQ and SAQ cardiovascular pharmacology questions, regardless of difficulty level. Copilot ranks second after ChatGPT, while Google Gemini shows significant limitations in handling complex MCQs and providing accurate responses to SAQ-type questions in this field. These findings can guide the ongoing refinement of AI tools for specialized medical education. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Edited by: Jacqueline G. Bloomfield, The University of Sydney, Australia Reviewed by: Nataly Martini, The University of Auckland, New Zealand Rebecca S. Koszalinski, University of Central Florida, United States |
ISSN: | 2296-858X 2296-858X |
DOI: | 10.3389/fmed.2025.1495378 |