Performance of artificial intelligence chatbots in sleep medicine certification board exams: ChatGPT versus Google Bard

Purpose To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. Methods A total of 301 text-based single-best-answer multiple choice questions with four answer options each, a...

Full description

Saved in:
Bibliographic Details
Published inEuropean archives of oto-rhino-laryngology Vol. 281; no. 4; pp. 2137 - 2143
Main Authors Cheong, Ryan Chin Taw, Pang, Kenny Peter, Unadkat, Samit, Mcneillis, Venkata, Williamson, Andrew, Joseph, Jonathan, Randhawa, Premjit, Andrews, Peter, Paleri, Vinidh
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Purpose To conduct a comparative performance evaluation of GPT-3.5, GPT-4 and Google Bard in self-assessment questions at the level of the American Sleep Medicine Certification Board Exam. Methods A total of 301 text-based single-best-answer multiple choice questions with four answer options each, across 10 categories, were included in the study and transcribed as inputs for GPT-3.5, GPT-4 and Google Bard. The first output responses generated were selected and matched for answer accuracy against the gold-standard answer provided by the American Academy of Sleep Medicine for each question. A global score of 80% and above is required by human sleep medicine specialists to pass each exam category. Results GPT-4 successfully achieved the pass mark of 80% or above in five of the 10 exam categories, including the Normal Sleep and Variants Self-Assessment Exam (2021), Circadian Rhythm Sleep–Wake Disorders Self-Assessment Exam (2021), Insomnia Self-Assessment Exam (2022), Parasomnias Self-Assessment Exam (2022) and the Sleep-Related Movements Self-Assessment Exam (2023). GPT-4 demonstrated superior performance in all exam categories and achieved a higher overall score of 68.1% when compared against both GPT-3.5 (46.8%) and Google Bard (45.5%), which was statistically significant ( p value < 0.001). There was no significant difference in the overall score performance between GPT-3.5 and Google Bard. Conclusions Otolaryngologists and sleep medicine physicians have a crucial role through agile and robust research to ensure the next generation AI chatbots are built safely and responsibly .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0937-4477
1434-4726
DOI:10.1007/s00405-023-08381-3