Reliability of a generative artificial intelligence tool for pediatric familial Mediterranean fever: insights from a multicentre expert survey

Artificial intelligence (AI) has become a popular tool for clinical and research use in the medical field. The aim of this study was to evaluate the accuracy and reliability of a generative AI tool on pediatric familial Mediterranean fever (FMF). Fifteen questions repeated thrice on pediatric FMF we...

Full description

Saved in:

Bibliographic Details
Published in	Pediatric rheumatology online journal Vol. 22; no. 1; pp. 78 - 11
Main Authors	La Bella, Saverio, Attanasi, Marina, Porreca, Annamaria, Di Ludovico, Armando, Maggio, Maria Cristina, Gallizzi, Romina, La Torre, Francesco, Rigante, Donato, Soscia, Francesca, Ardenti Morini, Francesca, Insalaco, Antonella, Natale, Marco Francesco, Chiarelli, Francesco, Simonini, Gabriele, De Benedetti, Fabrizio, Gattorno, Marco, Breda, Luciana
Format	Journal Article
Language	English
Published	England BioMed Central Ltd 23.08.2024 BMC
Subjects	Artificial Intelligence Child Computer software industry Disodium pamidronate Familial Mediterranean fever Familial Mediterranean Fever - diagnosis FMF Generative artificial intelligence Humans Medical care Observer Variation Pediatric rheumatology Pediatrics Quality management Reproducibility of Results Surveys Surveys and Questionnaires United States AI Pediatric rheumatology Generative artificial intelligence Artificial intelligence FMF Familial mediterranean fever
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Artificial intelligence (AI) has become a popular tool for clinical and research use in the medical field. The aim of this study was to evaluate the accuracy and reliability of a generative AI tool on pediatric familial Mediterranean fever (FMF). Fifteen questions repeated thrice on pediatric FMF were prompted to the popular generative AI tool Microsoft Copilot with Chat-GPT 4.0. Nine pediatric rheumatology experts rated response accuracy with a blinded mechanism using a Likert-like scale with values from 1 to 5. Median values for overall responses at the initial assessment ranged from 2.00 to 5.00. During the second assessment, median values spanned from 2.00 to 4.00, while for the third assessment, they ranged from 3.00 to 4.00. Intra-rater variability showed poor to moderate agreement (intraclass correlation coefficient range: -0.151 to 0.534). A diminishing level of agreement among experts over time was documented, as highlighted by Krippendorff's alpha coefficient values, ranging from 0.136 (at the first response) to 0.132 (at the second response) to 0.089 (at the third response). Lastly, experts displayed varying levels of trust in AI pre- and post-survey. AI has promising implications in pediatric rheumatology, including early diagnosis and management optimization, but challenges persist due to uncertain information reliability and the lack of expert validation. Our survey revealed considerable inaccuracies and incompleteness in AI-generated responses regarding FMF, with poor intra- and extra-rater reliability. Human validation remains crucial in managing AI-generated medical information.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1546-0096 1546-0096
DOI:	10.1186/s12969-024-01011-0