Performance of Generative Pre-trained Transformer (GPT)-4 and Gemini Advanced on the First-Class Radiation Protection Supervisor Examination in Japan

​Purpose The purpose of this study was to evaluate the capabilities of large language models (LLMs) in understanding radiation safety and protection. We assessed the performance of generative pe-trained transformer (GPT)-4 (OpenAI, USA) and Gemini Advanced (Google DeepMind, London) using questions f...

Full description

Saved in:
Bibliographic Details
Published inCurēus (Palo Alto, CA) Vol. 16; no. 10; p. e70614
Main Authors Goto, Hiroki, Shiraishi, Yoshioki, Okada, Seiji
Format Journal Article
LanguageEnglish
Published United States Cureus 01.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:​Purpose The purpose of this study was to evaluate the capabilities of large language models (LLMs) in understanding radiation safety and protection. We assessed the performance of generative pe-trained transformer (GPT)-4 (OpenAI, USA) and Gemini Advanced (Google DeepMind, London) using questions from the First-Class Radiation Protection Supervisor Examination in Japan. Methods The study involved GPT-4 and Gemini Advanced answering questions from the 68th First-Class Radiation Protection Supervisor Examination in Japan. The number of correct and incorrect answers based on the subject, the presence or absence of calculation, the passage length, and the format (textual or graphical questions) were analyzed in this study. Comparisons of the results between GPT-4 and Gemini Advanced were performed. Results The overall accuracy rates of GPT-4 and Gemini Advanced were 71.0% and 65.3%, respectively. A significant difference was observed in the subject (P < 0.0001 in GPT-4 and P = 0.0127 in Gemini Advanced). The accuracy rate of laws and regulations was lower than in the other subjects. There was no significant difference in the presence or absence of calculation or the passage length. The performance of both LLMs was significantly better in textual questions than in graphical questions (P = 0.0003 in GPT-4 and P < 0.0001 in Gemini Advanced). The performance of the two LLMs did not differ significantly based on the subject, the presence or absence of calculation, the passage length, or the format. Conclusions GPT-4 and Gemini Advanced demonstrated sufficient understanding of physics, chemistry, biology, and practical operations to meet the passing standard for the average score. However, in laws and regulations, their performance was insufficient, possibly due to frequent revisions and the complexity of detailed regulations, and further machine learning is required.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2168-8184
2168-8184
DOI:10.7759/cureus.70614