Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI)

Objectives To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI). Methods Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance w...

Full description

Saved in:

Bibliographic Details
Published in	European archives of oto-rhino-laryngology Vol. 281; no. 4; pp. 2063 - 2079
Main Authors	Lechien, Jerome R., Maniaci, Antonino, Gengler, Isabelle, Hans, Stephane, Chiesa-Estomba, Carlos M., Vaira, Luigi A.
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024 Springer Verlag
Subjects	Artificial Intelligence Female Head and Neck Surgery Humans Laryngology Life Sciences Medicine Medicine & Public Health Neurosurgery Otorhinolaryngology Psychometrics Reproducibility of Results Surveys and Questionnaires Artificial Intelligence Otolaryngology Chatbot GPT Medicine Head neck Treatment ChatGPT Comparison Surgery Instrument Diagnosis Performance Tool Artificial; Chatbot; ChatGPT; Comparison; Diagnosis; GPT; Head neck; Instrument; Intelligence; Medicine; Otolaryngology; Performance; Surgery; Tool; Treatment
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Objectives To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI). Methods Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test–retest reliability. Internal consistency was evaluated using Cronbach’s α . Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall’s tau. Results Forty-five patients completed the evaluations (28 females). The AIPI Cronbach’s alpha analysis suggested an adequate internal consistency ( α = 0.754). The test–retest reliability was moderate-to-strong for items and the total score of AIPI ( r s = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity ( p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT ( r s = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score. Conclusions AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0937-4477 1434-4726
DOI:	10.1007/s00405-023-08219-y