Analyzing the performance of multimodal large language models on visually-based questions in the Japanese National Examination for Dental Technicians

AbstractBackground/purposeLarge language models (LLMs) offer promising applications in dentistry, but their performance in specialized, image-rich contexts such as dental technology examinations remains uncertain. The purpose of this study was to evaluate the accuracy of three multimodal LLMs, ChatG...

Full description

Saved in:
Bibliographic Details
Published inJournal of dental sciences
Main Authors Mine, Yuichi, Taji, Tsuyoshi, Okazaki, Shota, Takeda, Saori, Peng, Tzu-Yu, Shimoe, Saiji, Kaku, Masato, Nikawa, Hiroki, Kakimoto, Naoya, Murayama, Takeshi
Format Journal Article
LanguageEnglish
Published Elsevier B.V 2025
Subjects
Online AccessGet full text
ISSN1991-7902
DOI10.1016/j.jds.2025.02.022

Cover

More Information
Summary:AbstractBackground/purposeLarge language models (LLMs) offer promising applications in dentistry, but their performance in specialized, image-rich contexts such as dental technology examinations remains uncertain. The purpose of this study was to evaluate the accuracy of three multimodal LLMs, ChatGPT-4o (4o), OpenAI o1 (o1), and Claude 3.5 Sonnet (Sonnet), when presented with questions from the Japanese National Examination for Dental Technicians. Materials and methodsA total of 240 multiple-choice questions from 2022 to 2024 theory sections of the exam were used. Each question, including its accompanying figures or images where applicable, was presented to the three LLMs in a zero-shot manner without specialized prompting. Correct response rates were calculated overall, as well as by question type (text-only vs. visually-based) and subject area. Statistical comparisons were performed using Cochran's Q test, followed by McNemar's test with Bonferroni correction where indicated. ResultsOverall correct response rates were 58.3 % (4o), 67.5 % (o1), and 64.6 % (Claude 3.5 Sonnet). For text-only questions, o1 achieved the highest accuracy (79.1 %), significantly outperforming 4o (68.3 %; P = 0.017). In contrast, all models showed reduced accuracy on visually-based questions (44.6–55.4 %), with no significant difference among them. ConclusionThese results suggest that multimodal LLMs can supplement theoretical dental technology education, although their limited performance on visual tasks indicates the need for traditional hands-on training. Enhanced image interpretation skills may help address workforce challenges in dental technology.
ISSN:1991-7902
DOI:10.1016/j.jds.2025.02.022