Comparative Evaluation of Multiplatform AI Performance on Practical Ophthalmology Exam Questions: Insights from the Brazilian Council of Ophthalmology Exam

In recent years, advances in artificial intelligence (AI), especially with the emergence of natural language models and deep neural networks, have revolutionised medical practice, offering tools with the potential to assist both in diagnosis and specialised medical training. The main objective of th...

Full description

Saved in:
Bibliographic Details
Published inJournal of Advances in Medicine and Medical Research Vol. 37; no. 8; pp. 159 - 170
Main Authors Nunes, Déborah Silva, David, Joacy Pedro Franco, Filho, José Jesu Sisnando D'Araujo, Nascimento, Kelly Cristina Costa Guedes, Coutinho, Igor Jordan Barbosa, Ferraz, Rebeca Andrade, Zemero, Maria Isabel Muniz, Fayal, Syenne Pimentel, Passos, Ana Caroline Coelho dos, Barros, Luis Eduardo de Carvalho, Virgolino, Rodrigo Rodrigues, Marques, George de Almeida, Lima, Vitor Hugo Auzier
Format Journal Article
LanguageEnglish
Published 19.08.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In recent years, advances in artificial intelligence (AI), especially with the emergence of natural language models and deep neural networks, have revolutionised medical practice, offering tools with the potential to assist both in diagnosis and specialised medical training. The main objective of this study was to evaluate the accuracy and agreement of different artificial intelligence (AI) models in solving practical questions from the Brazilian Council of Ophthalmology (CBO) Exam. To this end, the performances of 5 AI models (ChatGPT, Gemini, DeepSeek, Google AI Studio, and GROK) were analyzed in a set of 560 questions, distributed in eight thematic blocks of ophthalmology (Cornea, Cataract, Retina, Glaucoma, Neuro-ophthalmology, Optics and Refraction, Strabismus, and Plastic Surgery/Lacrimal Duct/Orbit). The answers were compared to the official answer key by calculating the percentage of correct answers and the Cohen's Kappa and Fleiss's coefficients of agreement. Cohen's Kappa coefficient was used to measure the agreement between the AI responses and the official template, as well as Fleiss's Kappa to measure the overall agreement between the different AIs. The most evident finding was that the Gemini model presented the highest accuracy rate (77.6%) and the highest overall agreement with the official answer key. Significant variation in performance between blocks was also observed, with greater accuracy in the Retina and Glaucoma themes, and lower accuracy in the Strabismus and Plastic Surgery blocks. The thematic analysis allowed us to identify the pattern of correct answers by speciality, revealing weaknesses of the models in areas with greater dependence on visual assessment and clinical subjectivity. In addition to a probable educational applicability of AIs, it proved to be viable as a complementary tool in medical training, especially when used under supervision and with defined pedagogical objectives. Therefore, it was concluded that, despite the limitations, the most up-to-date models trained based on specific clinical data were able to faithfully reproduce diagnostic reasoning in several areas of ophthalmology, evidencing their potential for integration into specialised education, as long as they are used with technical and ethical criteria. These findings suggest AI can serve as a supplementary tool in ophthalmic education, with caution in subjective specialities.
ISSN:2456-8899
2456-8899
DOI:10.9734/jammr/2025/v37i85913