ChatGPT-4 Consistency in Interpreting Laryngeal Clinical Images of Common Lesions and Disorders

To investigate the consistency of Chatbot Generative Pretrained Transformer (ChatGPT)-4 in the analysis of clinical pictures of common laryngological conditions. Prospective uncontrolled study. Multicenter study. Patient history and clinical videolaryngostroboscopic images were presented to ChatGPT-...

Full description

Saved in:

Bibliographic Details
Published in	Otolaryngology-head and neck surgery Vol. 171; no. 4; p. 1106
Main Authors	Maniaci, Antonino, Chiesa-Estomba, Carlos M, Lechien, Jérôme R
Format	Journal Article
Language	English
Published	England 01.10.2024
Subjects	Adult Aged Artificial Intelligence Diagnosis, Differential Female Humans Image Interpretation, Computer-Assisted Laryngeal Diseases - diagnosis Laryngeal Diseases - diagnostic imaging Laryngoscopy Male Middle Aged Prospective Studies Stroboscopy head neck surgery images otolaryngology ChatGPT accuracy laryngology GPT video artificial intelligence picture
Online Access	Get more information

Cover

Loading…

More Information
Summary:	To investigate the consistency of Chatbot Generative Pretrained Transformer (ChatGPT)-4 in the analysis of clinical pictures of common laryngological conditions. Prospective uncontrolled study. Multicenter study. Patient history and clinical videolaryngostroboscopic images were presented to ChatGPT-4 for differential diagnoses, management, and treatment(s). ChatGPT-4 responses were assessed by 3 blinded laryngologists with the artificial intelligence performance instrument (AIPI). The complexity of cases and the consistency between practitioners and ChatGPT-4 for interpreting clinical images were evaluated with a 5-point Likert Scale. The intraclass correlation coefficient (ICC) was used to measure the strength of interrater agreement. Forty patients with a mean complexity score of 2.60 ± 1.15. were included. The mean consistency score for ChatGPT-4 image interpretation was 2.46 ± 1.42. ChatGPT-4 perfectly analyzed the clinical images in 6 cases (15%; 5/5), while the consistency between GPT-4 and judges was high in 5 cases (12.5%; 4/5). Judges reported an ICC of 0.965 for the consistency score (P = .001). ChatGPT-4 erroneously documented vocal fold irregularity (mass or lesion), glottic insufficiency, and vocal cord paralysis in 21 (52.5%), 2 (0.05%), and 5 (12.5%) cases, respectively. ChatGPT-4 and practitioners indicated 153 and 63 additional examinations, respectively (P = .001). The ChatGPT-4 primary diagnosis was correct in 20.0% to 25.0% of cases. The clinical image consistency score was significantly associated with the AIPI score (r = 0.830; P = .001). The ChatGPT-4 is more efficient in primary diagnosis, rather than in the image analysis, selecting the most adequate additional examinations and treatments.
ISSN:	1097-6817
DOI:	10.1002/ohn.897