Large Language Models in Biochemistry Education: Comparative Evaluation of Performance
Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pa...
Saved in:
Published in | JMIR medical education Vol. 11; p. e67244 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Canada
JMIR Publications
10.04.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 2369-3762 2369-3762 |
DOI | 10.2196/67244 |
Cover
Loading…
Summary: | Recent advancements in artificial intelligence (AI), particularly in large language models (LLMs), have started a new era of innovation across various fields, with medicine at the forefront of this technological revolution. Many studies indicated that at the current level of development, LLMs can pass different board exams. However, the ability to answer specific subject-related questions requires validation.
The objective of this study was to conduct a comprehensive analysis comparing the performance of advanced LLM chatbots-Claude (Anthropic), GPT-4 (OpenAI), Gemini (Google), and Copilot (Microsoft)-against the academic results of medical students in the medical biochemistry course.
We used 200 USMLE (United States Medical Licensing Examination)-style multiple-choice questions (MCQs) selected from the course exam database. They encompassed various complexity levels and were distributed across 23 distinctive topics. The questions with tables and images were not included in the study. The results of 5 successive attempts by Claude 3.5 Sonnet, GPT-4-1106, Gemini 1.5 Flash, and Copilot to answer this questionnaire set were evaluated based on accuracy in August 2024. Statistica 13.5.0.17 (TIBCO Software Inc) was used to analyze the data's basic statistics. Considering the binary nature of the data, the chi-square test was used to compare results among the different chatbots, with a statistical significance level of P<.05.
On average, the selected chatbots correctly answered 81.1% (SD 12.8%) of the questions, surpassing the students' performance by 8.3% (P=.02). In this study, Claude showed the best performance in biochemistry MCQs, correctly answering 92.5% (185/200) of questions, followed by GPT-4 (170/200, 85%), Gemini (157/200, 78.5%), and Copilot (128/200, 64%). The chatbots demonstrated the best results in the following 4 topics: eicosanoids (mean 100%, SD 0%), bioenergetics and electron transport chain (mean 96.4%, SD 7.2%), hexose monophosphate pathway (mean 91.7%, SD 16.7%), and ketone bodies (mean 93.8%, SD 12.5%). The Pearson chi-square test indicated a statistically significant association between the answers of all 4 chatbots (P<.001 to P<.04).
Our study suggests that different AI models may have unique strengths in specific medical fields, which could be leveraged for targeted support in biochemistry courses. This performance highlights the potential of AI in medical education and assessment. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 None declared. |
ISSN: | 2369-3762 2369-3762 |
DOI: | 10.2196/67244 |