Evaluating the Quality of Psychotherapy Conversational Agents: Framework Development and Cross-Sectional Study

Despite potential risks, artificial intelligence-based chatbots that simulate psychotherapy are becoming more widely available and frequently used by the general public. A comprehensive way of evaluating the quality of these chatbots is needed. To address this need, we developed the CAPE (Conversati...

Full description

Saved in:

Bibliographic Details
Published in	JMIR formative research Vol. 9; p. e65605
Main Authors	Sobowale, Kunmi, Humphrey, Daniel Kevin
Format	Journal Article
Language	English
Published	Canada JMIR Publications 02.07.2025
Subjects	AI-Powered Therapy Bots and Virtual Companions in Digital Mental Health Artificial Intelligence Chatbots Chatbots and Conversational Agents Communication Cross-Sectional Studies Development and Evaluation of Research Methods, Instruments and Tools Digital Mental Health Interventions, e-Mental Health and Cyberpsychology Eating behavior Generative artificial intelligence Generative Language Models Including ChatGPT Humans Large language models Mental disorders Mental health care Original Paper Popularity Privacy Psychotherapy Psychotherapy - methods Psychotherapy - standards Research Instruments, Questionnaires, and Tools Telemedicine Therapists Therapy psychotherapy chatbots treatment evaluation study generative AI accessibility researchers risk evaluation therapy AI psychotherapy artificial intelligence conversational agent digital health therapeutic alliance ChatGPT chatbots clinicians large language models evaluation framework
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Despite potential risks, artificial intelligence-based chatbots that simulate psychotherapy are becoming more widely available and frequently used by the general public. A comprehensive way of evaluating the quality of these chatbots is needed. To address this need, we developed the CAPE (Conversational Agent for Psychotherapy Evaluation) framework to aid clinicians, researchers, and lay users in assessing psychotherapy chatbot quality. We use the framework to evaluate and compare the quality of popular artificial intelligence psychotherapy chatbots on the OpenAI GPT store. We identified 4 popular chatbots on OpenAI's GPT store. Two reviewers independently applied the CAPE framework to these chatbots, using 2 fictional personas to simulate interactions. The modular framework has 8 sections, each yielding an independent quality subscore between 0 and 1. We used t tests and nonparametric Wilcoxon signed rank tests to examine pairwise differences in quality subscores between chatbots. Chatbots consistently scored highly on the sections of background information (subscores=0.83-1), conversational capabilities (subscores=0.83-1), therapeutic alliance, and boundaries (subscores=0.75-1), and accessibility (subscores=0.8-0.95). Scores were low for the therapeutic orientation (subscores=0) and monitoring and risk evaluation sections (subscores=0.67-0.75). Information on training data and knowledge base sections was not transparent (subscores=0). Except for the privacy and harm section (mean 0.017, SD 0.00; t3=∞; P<.001), there were no differences in subscores between the chatbots. The CAPE framework offers a robust and reliable method for assessing the quality of psychotherapy chatbots, enabling users to make informed choices based on their specific needs and preferences. Our evaluation revealed that while the popular chatbots on OpenAI's GPT store were effective at developing rapport and were easily accessible, they failed to address essential safety and privacy functions adequately.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2561-326X 2561-326X
DOI:	10.2196/65605