Enhancing systematic literature reviews with generative artificial intelligence: development, applications, and performance evaluation

We developed and validated a large language model (LLM)-assisted system for conducting systematic literature reviews in health technology assessment (HTA) submissions. We developed a five-module system using abstracts acquired from PubMed: (1) literature search query setup; (2) study protocol setup...

Full description

Saved in:

Bibliographic Details
Published in	Journal of the American Medical Informatics Association : JAMIA Vol. 32; no. 4; pp. 616 - 625
Main Authors	Li, Ying, Datta, Surabhi, Rastegar-Mojarad, Majid, Lee, Kyeryoung, Paek, Hunki, Glasgow, Julie, Liston, Chris, He, Long, Wang, Xiaoyan, Xu, Yingxin
Format	Journal Article
Language	English
Published	England Oxford University Press 01.04.2025
Subjects	Artificial Intelligence Generative Artificial Intelligence Humans Information Storage and Retrieval - methods Research and Applications Systematic Reviews as Topic Technology Assessment, Biomedical human-in-the loop AI information extraction GPT-4 large language model systematic literature review
Online Access	Get full text
ISSN	1067-5027 1527-974X 1527-974X
DOI	10.1093/jamia/ocaf030

Cover

Loading…

More Information
Summary:	We developed and validated a large language model (LLM)-assisted system for conducting systematic literature reviews in health technology assessment (HTA) submissions. We developed a five-module system using abstracts acquired from PubMed: (1) literature search query setup; (2) study protocol setup using population, intervention/comparison, outcome, and study type (PICOs) criteria; (3) LLM-assisted abstract screening; (4) LLM-assisted data extraction; and (5) data summarization. The system incorporates a human-in-the-loop design, allowing real-time PICOs criteria adjustment. This is achieved by collecting information on disagreements between the LLM and human reviewers regarding inclusion/exclusion decisions and their rationales, enabling informed PICOs refinement. We generated four evaluation sets including relapsed and refractory multiple myeloma (RRMM) and advanced melanoma to evaluate the LLM's performance in three key areas: (1) recommending inclusion/exclusion decisions during abstract screening, (2) providing valid rationales for abstract exclusion, and (3) extracting relevant information from included abstracts. The system demonstrated relatively high performance across all evaluation sets. For abstract screening, it achieved an average sensitivity of 90%, F1 score of 82, accuracy of 89%, and Cohen's κ of 0.71, indicating substantial agreement between human reviewers and LLM-based results. In identifying specific exclusion rationales, the system attained accuracies of 97% and 84%, and F1 scores of 98 and 89 for RRMM and advanced melanoma, respectively. For data extraction, the system achieved an F1 score of 93. Results showed high sensitivity, Cohen's κ, and PABAK for abstract screening, and high F1 scores for data extraction. This human-in-the-loop AI-assisted SLR system demonstrates the potential of GPT-4's in context learning capabilities by eliminating the need for manually annotated training data. In addition, this LLM-based system offers subject matter experts greater control through prompt adjustment and real-time feedback, enabling iterative refinement of PICOs criteria based on performance metrics. The system demonstrates potential to streamline systematic literature reviews, potentially reducing time, cost, and human errors while enhancing evidence generation for HTA submissions.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1067-5027 1527-974X 1527-974X
DOI:	10.1093/jamia/ocaf030