Evaluation of Large language model performance on the Multi-Specialty Recruitment Assessment (MSRA) exam

AI-powered platforms have gained prominence in medical education and training, offering diverse applications from surgical performance assessment to exam preparation. This research paper examines the capabilities of Large Language Models (LLMs), including Llama 2, Google Bard, Bing Chat, and ChatGPT...

Full description

Saved in:

Bibliographic Details
Published in	Computers in biology and medicine Vol. 168; p. 107794
Main Authors	Tsoutsanis, Panagiotis, Tsoutsanis, Aristotelis
Format	Journal Article
Language	English
Published	United States Elsevier Ltd 01.01.2024 Elsevier Limited
Subjects	Artificial intelligence Chatbots Education Internal Medicine Knowledge acquisition Language Large language models Medical education Medical exam Multiple choice Ophthalmology Other Performance assessment Performance evaluation Problem solving Questions Recruitment Response rates Statistical analysis Womens health United Kingdom > UK United States > US Medical education Large language models Artificial intelligence Medical exam
Online Access	Get full text

Cover

Loading…

More Information
Summary:	AI-powered platforms have gained prominence in medical education and training, offering diverse applications from surgical performance assessment to exam preparation. This research paper examines the capabilities of Large Language Models (LLMs), including Llama 2, Google Bard, Bing Chat, and ChatGPT-3.5, in answering multiple-choice questions of the Clinical Problem Solving (CPS) paper of the Multi-Specialty Recruitment Assessment (MSRA) exam. Using a dataset of 100 CPS questions from ten subject categories, we assessed the LLMs' performance against medical doctors preparing for the exam. Results showed that Bing Chat outperformed all other LLMs and even surpassed human users from the Qbank question bank. Conversely, Llama 2's performance was inferior to human users. Google Bard and ChatGPT 3.5 did not exhibit statistically significant differences in correct response rates compared to human candidates. Pairwise comparisons demonstrated Bing Chat's significant superiority over Llama 2, Google Bard, and ChatGPT 3.5. However, no significant differences were found between Llama 2 and Google Bard, Llama 2, and ChatGPT-3.5, and Google Bard and ChatGPT-3.5. Freely available LLMs have already demonstrated that they can perform as well or even outperform human users in answering MSRA exam questions. Bing Chat emerged as a particularly strong performer. The study also highlights the potential for enhancing LLMs' medical knowledge acquisition through tailored fine-tuning. Medical knowledge tailored LLMs such as Med-PaLM, have already shown promising results. We provided valuable insights into LLMs' competence in answering medical MCQs and their potential integration into medical education and assessment processes. •Large language models (LLMs) such as Bing Chat can outperform humans in answering MSRA exam questions.•LLMs can achieve passing scores in various undergraduate and postgraduate medical examinations.•AI tools such as LLMs could be utilized as medical education tutor-aids.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0010-4825 1879-0534 1879-0534
DOI:	10.1016/j.compbiomed.2023.107794