Comparative Evaluation of Multiplatform AI Performance on Practical Ophthalmology Exam Questions: Insights from the Brazilian Council of Ophthalmology Exam
In recent years, advances in artificial intelligence (AI), especially with the emergence of natural language models and deep neural networks, have revolutionised medical practice, offering tools with the potential to assist both in diagnosis and specialised medical training. The main objective of th...
Saved in:
Published in | Journal of Advances in Medicine and Medical Research Vol. 37; no. 8; pp. 159 - 170 |
---|---|
Main Authors | , , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
19.08.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In recent years, advances in artificial intelligence (AI), especially with the emergence of natural language models and deep neural networks, have revolutionised medical practice, offering tools with the potential to assist both in diagnosis and specialised medical training. The main objective of this study was to evaluate the accuracy and agreement of different artificial intelligence (AI) models in solving practical questions from the Brazilian Council of Ophthalmology (CBO) Exam. To this end, the performances of 5 AI models (ChatGPT, Gemini, DeepSeek, Google AI Studio, and GROK) were analyzed in a set of 560 questions, distributed in eight thematic blocks of ophthalmology (Cornea, Cataract, Retina, Glaucoma, Neuro-ophthalmology, Optics and Refraction, Strabismus, and Plastic Surgery/Lacrimal Duct/Orbit). The answers were compared to the official answer key by calculating the percentage of correct answers and the Cohen's Kappa and Fleiss's coefficients of agreement. Cohen's Kappa coefficient was used to measure the agreement between the AI responses and the official template, as well as Fleiss's Kappa to measure the overall agreement between the different AIs. The most evident finding was that the Gemini model presented the highest accuracy rate (77.6%) and the highest overall agreement with the official answer key. Significant variation in performance between blocks was also observed, with greater accuracy in the Retina and Glaucoma themes, and lower accuracy in the Strabismus and Plastic Surgery blocks. The thematic analysis allowed us to identify the pattern of correct answers by speciality, revealing weaknesses of the models in areas with greater dependence on visual assessment and clinical subjectivity. In addition to a probable educational applicability of AIs, it proved to be viable as a complementary tool in medical training, especially when used under supervision and with defined pedagogical objectives. Therefore, it was concluded that, despite the limitations, the most up-to-date models trained based on specific clinical data were able to faithfully reproduce diagnostic reasoning in several areas of ophthalmology, evidencing their potential for integration into specialised education, as long as they are used with technical and ethical criteria. These findings suggest AI can serve as a supplementary tool in ophthalmic education, with caution in subjective specialities. |
---|---|
ISSN: | 2456-8899 2456-8899 |
DOI: | 10.9734/jammr/2025/v37i85913 |