B - 113 Assessing the Neuropsychology Information Base of Large Language Models
Abstract Objective Research has demonstrated that Large Language Models (LLMs) can obtain passing scores on medical board-certification examinations and have made substantial improvements in recent years (e.g., ChatGPT-4 and ChatGPT-3.5 demonstrating an accuracy of 83.4% and 73.4%, respectively, on...
Saved in:
Published in | Archives of clinical neuropsychology |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
12.09.2024
|
Online Access | Get full text |
Cover
Loading…
Summary: | Abstract Objective Research has demonstrated that Large Language Models (LLMs) can obtain passing scores on medical board-certification examinations and have made substantial improvements in recent years (e.g., ChatGPT-4 and ChatGPT-3.5 demonstrating an accuracy of 83.4% and 73.4%, respectively, on neurosurgical practice written board-certification questions). To date, the extent of LLMs’ neuropsychology domain information has not been investigated. This study is an initial exploration of ChatGPT-3.5, ChatGPT-4, and Gemini’s performance on mock clinical neuropsychology written board-certification examination questions. Methods Six hundred practice examination questions were obtained from the BRAIN American Academy of Clinical Neuropsychology (AACN) website. Data for specific question domains and pediatric subclassification were available for 300 items. Using an a priori prompting strategy, the questions were input into ChatGPT-3.5, ChatGPT-4, and Gemini. Responses were scored based on BRAIN AACN answer keys. Chi-squared tests assessed LLMs’ performance overall and within domains, and significance was set at p = 0.002 using Bonferroni correction. Results Across all six hundred items, ChatGPT-4 had superior accuracy (74%) to ChatGPT-3.5 (62.5%) and Gemini (52.7%; p’s < 0.001). The LLMs had lower performance on items with domain information (ChatGPT-4 = 66%, ChatGPT-3.5 = 56%, Gemini = 42%; see Table 1). Generally, ChatGPT-4 performed better across domains (range = 59.5%–74%) than ChatGPT-3 (range = 48.4%–65.6%) and Gemini (range = 38.1%–50%). The same trend was observed for pediatric questions (ChatGPT-4 = 65%, ChatGPT-3.5 = 51.7%, Gemini = 46.7%). Conclusions Consistent with reports in other medical subspecialties, these findings reflect LLMs’ rapidly expanding neuropsychology information base. With additional neuropsychology-specific training, LLMs may have utility in educational and clinical training settings. It is incumbent upon neuropsychologists to explore the various applications of this technology. |
---|---|
ISSN: | 1873-5843 1873-5843 |
DOI: | 10.1093/arclin/acae067.274 |