Comparative Analysis of the Response Accuracies of Large Language Models in the Korean National Dental Hygienist Examination Across Korean and English Questions
ABSTRACT Introduction Large language models such as Gemini, GPT‐3.5, and GPT‐4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge....
Saved in:
Published in | International journal of dental hygiene Vol. 23; no. 2; pp. 267 - 276 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
England
Blackwell Publishing Ltd
01.05.2025
John Wiley and Sons Inc |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | ABSTRACT
Introduction
Large language models such as Gemini, GPT‐3.5, and GPT‐4 have demonstrated significant potential in the medical field. Their performance in medical licensing examinations globally has highlighted their capabilities in understanding and processing specialized medical knowledge. This study aimed to evaluate and compare the performance of Gemini, GPT‐3.5, and GPT‐4 in the Korean National Dental Hygienist Examination. The accuracy of answering the examination questions in both Korean and English was assessed.
Methods
This study used a dataset comprising questions from the Korean National Dental Hygienist Examination over 5 years (2019–2023). A two‐way analysis of variance (ANOVA) test was employed to investigate the impacts of model type and language on the accuracy of the responses. Questions were input into each model under standardized conditions, and responses were classified as correct or incorrect based on predefined criteria.
Results
GPT‐4 consistently outperformed the other models, achieving the highest accuracy rates across both language versions annually. In particular, it showed superior performance in English, suggesting advancements in its training algorithms for language processing. However, all models demonstrated variable accuracies in subjects with localized characteristics, such as health and medical law.
Conclusions
These findings indicate that GPT‐4 holds significant promise for application in medical education and standardized testing, especially in English. However, the variability in performance across different subjects and languages underscores the need for ongoing improvements and the inclusion of more diverse and localized training datasets to enhance the models' effectiveness in multilingual and multicultural contexts. |
---|---|
Bibliography: | Funding This work was supported by the Korea Medical Device Development Fund grant funded by the Korean Government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711196792, RS‐2023‐00253380). ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Funding: This work was supported by the Korea Medical Device Development Fund grant funded by the Korean Government (the Ministry of Science and ICT, the Ministry of Trade, Industry and Energy, the Ministry of Health & Welfare, the Ministry of Food and Drug Safety) (Project Number: 1711196792, RS‐2023‐00253380). |
ISSN: | 1601-5029 1601-5037 1601-5037 |
DOI: | 10.1111/idh.12848 |