Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard

Purpose To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. Methods A total of 301 text-based single-best-answer multiple choice questions with four answer options each, a...

Full description

Saved in:

Bibliographic Details
Published in	European archives of oto-rhino-laryngology Vol. 281; no. 4; pp. 2137 - 2143
Main Authors	Cheong, Ryan Chin Taw, Pang, Kenny Peter, Unadkat, Samit, Mcneillis, Venkata, Williamson, Andrew, Joseph, Jonathan, Randhawa, Premjit, Andrews, Peter, Paleri, Vinidh
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024
Subjects	Artificial Intelligence Certification Head and Neck Surgery Humans Medicine Medicine & Public Health Miscellaneous Neurosurgery Otorhinolaryngology Physicians Search Engine Sleep Certification examinations Large language models Sleep medicine ChatGPT Artificial intelligence Google Bard
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Purpose To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. Methods A total of 301 text-based single-best-answer multiple choice questions with four answer options each, across 10 categories, were included in the study and transcribed as inputs for GPT-3.5, GPT-4 and Google Bard. The first output responses generated were selected and matched for answer accuracy against the gold-standard answer provided by the American Academy of Sleep Medicine for each question. A global score of 80% and above is required by human sleep medicine specialists to pass each exam category. Results GPT-4 successfully achieved the pass mark of 80% or above in five of the 10 exam categories, including the Normal Sleep and Variants Self-Assessment Exam (2021), Circadian Rhythm Sleep–Wake Disorders Self-Assessment Exam (2021), Insomnia Self-Assessment Exam (2022), Parasomnias Self-Assessment Exam (2022) and the Sleep-Related Movements Self-Assessment Exam (2023). GPT-4 demonstrated superior performance in all exam categories and achieved a higher overall score of 68.1% when compared against both GPT-3.5 (46.8%) and Google Bard (45.5%), which was statistically significant ( p value < 0.001). There was no significant difference in the overall score performance between GPT-3.5 and Google Bard. Conclusions Otolaryngologists and sleep medicine physicians have a crucial role through agile and robust research to ensure the next generation AI chatbots are built safely and responsibly .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0937-4477 1434-4726
DOI:	10.1007/s00405-023-08381-3