Analyzing the performance of multimodal large language models on visually-based questions in the Japanese National Examination for Dental Technicians

AbstractBackground/purposeLarge language models (LLMs) offer promising applications in dentistry, but their performance in specialized, image-rich contexts such as dental technology examinations remains uncertain. The purpose of this study was to evaluate the accuracy of three multimodal LLMs, ChatG...

Full description

Saved in:

Bibliographic Details
Published in	Journal of dental sciences
Main Authors	Mine, Yuichi, Taji, Tsuyoshi, Okazaki, Shota, Takeda, Saori, Peng, Tzu-Yu, Shimoe, Saiji, Kaku, Masato, Nikawa, Hiroki, Kakimoto, Naoya, Murayama, Takeshi
Format	Journal Article
Language	English
Published	Elsevier B.V 2025
Subjects	Advanced Basic Science ChatGPT-4o Claude 3.5 sonnet Dental technician licensing examination Dentistry Image-based questions Multimodal large language models OpenAI o1 Dental technician licensing examination Claude 3.5 sonnet OpenAI o1 ChatGPT-4o Image-based questions Multimodal large language models
Online Access	Get full text
ISSN	1991-7902
DOI	10.1016/j.jds.2025.02.022

Cover

More Information
Summary:	AbstractBackground/purposeLarge language models (LLMs) offer promising applications in dentistry, but their performance in specialized, image-rich contexts such as dental technology examinations remains uncertain. The purpose of this study was to evaluate the accuracy of three multimodal LLMs, ChatGPT-4o (4o), OpenAI o1 (o1), and Claude 3.5 Sonnet (Sonnet), when presented with questions from the Japanese National Examination for Dental Technicians. Materials and methodsA total of 240 multiple-choice questions from 2022 to 2024 theory sections of the exam were used. Each question, including its accompanying figures or images where applicable, was presented to the three LLMs in a zero-shot manner without specialized prompting. Correct response rates were calculated overall, as well as by question type (text-only vs. visually-based) and subject area. Statistical comparisons were performed using Cochran's Q test, followed by McNemar's test with Bonferroni correction where indicated. ResultsOverall correct response rates were 58.3 % (4o), 67.5 % (o1), and 64.6 % (Claude 3.5 Sonnet). For text-only questions, o1 achieved the highest accuracy (79.1 %), significantly outperforming 4o (68.3 %; P = 0.017). In contrast, all models showed reduced accuracy on visually-based questions (44.6–55.4 %), with no significant difference among them. ConclusionThese results suggest that multimodal LLMs can supplement theoretical dental technology education, although their limited performance on visual tasks indicates the need for traditional hands-on training. Enhanced image interpretation skills may help address workforce challenges in dental technology.
ISSN:	1991-7902
DOI:	10.1016/j.jds.2025.02.022