Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI)
Objectives To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI). Methods Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance w...
Saved in:
Published in | European archives of oto-rhino-laryngology Vol. 281; no. 4; pp. 2063 - 2079 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.04.2024
Springer Verlag |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Objectives
To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI).
Methods
Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test–retest reliability. Internal consistency was evaluated using Cronbach’s
α
. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall’s tau.
Results
Forty-five patients completed the evaluations (28 females). The AIPI Cronbach’s alpha analysis suggested an adequate internal consistency (
α
= 0.754). The test–retest reliability was moderate-to-strong for items and the total score of AIPI (
r
s
= 0.486,
p
= 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (
p
= 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (
r
s
= 0.319;
p
= 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score.
Conclusions
AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0937-4477 1434-4726 |
DOI: | 10.1007/s00405-023-08219-y |