Evaluating the diagnostic performance of a large language model‐powered chatbot for providing immunohistochemistry recommendations in dermatopathology

Background Large language model (LLM)‐powered chatbots such as ChatGPT have numerous applications. However, their effectiveness in dermatopathology has not been formally evaluated. Dermatopathological cases often require immunohistochemical workup. Here, we evaluate the performance of a chatbot in p...

Full description

Saved in:

Bibliographic Details
Published in	Journal of cutaneous pathology Vol. 51; no. 9; pp. 689 - 695
Main Authors	McCrary, Myles R., Galambus, Justine, Chen, Wei‐Shen
Format	Journal Article
Language	English
Published	Oxford, UK Blackwell Publishing Ltd 01.09.2024 Wiley Subscription Services, Inc
Subjects	artificial intelligence Chatbots Immunohistochemistry large language model Large language models Performance evaluation large language model artificial intelligence immunohistochemistry
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background Large language model (LLM)‐powered chatbots such as ChatGPT have numerous applications. However, their effectiveness in dermatopathology has not been formally evaluated. Dermatopathological cases often require immunohistochemical workup. Here, we evaluate the performance of a chatbot in providing diagnostically useful information on immunohistochemistry relating to dermatological diseases. Methods We queried a commonly used chatbot for the immunophenotypes of 51 cutaneous diseases, including a diverse variety of epidermal, adnexal, hematolymphoid, and soft tissue entities. We requested it to provide references for each diagnosis. All tests were repeated, compiled, quantified, and then compared with established literature standards. Results Clustering analysis demonstrated that recommendations correlated with tumor type, suggesting chatbots can supply appropriate panels. However, a significant portion of recommendations were factually incorrect (13.9%). Citations were rarely clinically useful (24.5%). Many were confabulated (27.2%). Prompt responses for cutaneous adnexal lesions tended to be less accurate while literature references were less useful. Reference retrieval performance was associated with the number of PubMed entries per entity. Conclusions This foundational study suggests that LLM‐powered chatbots may be useful for generating immunohistochemical panels for dermatologic diagnoses. However, specific performance capabilities and biases must be considered. In addition, extreme caution is advised regarding the tendencies to fabricate material. Future models intentionally fine‐tuned to augment diagnostic medicine may prove to be valuable.
Bibliography:	Myles R. McCrary and Justine Galambus contributed equally to this study. ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0303-6987 1600-0560 1600-0560
DOI:	10.1111/cup.14631