B - 113 Assessing the Neuropsychology Information Base of Large Language Models

Abstract Objective Research has demonstrated that Large Language Models (LLMs) can obtain passing scores on medical board-certification examinations and have made substantial improvements in recent years (e.g., ChatGPT-4 and ChatGPT-3.5 demonstrating an accuracy of 83.4% and 73.4%, respectively, on...

Full description

Saved in:

Bibliographic Details
Published in	Archives of clinical neuropsychology
Main Authors	Kronenberger, Oscar, Bullinger, Leah, Kaser, Alyssa N, Cullum, Munro C, Schaffert, Jeffrey, Harder, Lana, Lacritz, Laura
Format	Journal Article
Language	English
Published	12.09.2024
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract Objective Research has demonstrated that Large Language Models (LLMs) can obtain passing scores on medical board-certification examinations and have made substantial improvements in recent years (e.g., ChatGPT-4 and ChatGPT-3.5 demonstrating an accuracy of 83.4% and 73.4%, respectively, on neurosurgical practice written board-certification questions). To date, the extent of LLMs’ neuropsychology domain information has not been investigated. This study is an initial exploration of ChatGPT-3.5, ChatGPT-4, and Gemini’s performance on mock clinical neuropsychology written board-certification examination questions. Methods Six hundred practice examination questions were obtained from the BRAIN American Academy of Clinical Neuropsychology (AACN) website. Data for specific question domains and pediatric subclassification were available for 300 items. Using an a priori prompting strategy, the questions were input into ChatGPT-3.5, ChatGPT-4, and Gemini. Responses were scored based on BRAIN AACN answer keys. Chi-squared tests assessed LLMs’ performance overall and within domains, and significance was set at p = 0.002 using Bonferroni correction. Results Across all six hundred items, ChatGPT-4 had superior accuracy (74%) to ChatGPT-3.5 (62.5%) and Gemini (52.7%; p’s < 0.001). The LLMs had lower performance on items with domain information (ChatGPT-4 = 66%, ChatGPT-3.5 = 56%, Gemini = 42%; see Table 1). Generally, ChatGPT-4 performed better across domains (range = 59.5%–74%) than ChatGPT-3 (range = 48.4%–65.6%) and Gemini (range = 38.1%–50%). The same trend was observed for pediatric questions (ChatGPT-4 = 65%, ChatGPT-3.5 = 51.7%, Gemini = 46.7%). Conclusions Consistent with reports in other medical subspecialties, these findings reflect LLMs’ rapidly expanding neuropsychology information base. With additional neuropsychology-specific training, LLMs may have utility in educational and clinical training settings. It is incumbent upon neuropsychologists to explore the various applications of this technology.
ISSN:	1873-5843 1873-5843
DOI:	10.1093/arclin/acae067.274