The emergence of large language models as tools in literature reviews: a large language model-assisted systematic review

This study aims to summarize the usage of large language models (LLMs) in the process of creating a scientific review by looking at the methodological papers that describe the use of LLMs in review automation and the review papers that mention they were made with the support of LLMs. The search was...

Full description

Saved in:

Bibliographic Details
Published in	Journal of the American Medical Informatics Association : JAMIA Vol. 32; no. 6; pp. 1071 - 1086
Main Authors	Scherbakov, Dmitry, Hubig, Nina, Jansari, Vinita, Bakumenko, Alexander, Lenert, Leslie A
Format	Journal Article
Language	English
Published	England Oxford University Press 01.06.2025
Subjects	Editor's Choice Humans Information Storage and Retrieval Large Language Models Programming Languages Review Review Literature as Topic Systematic Reviews as Topic Covidence large language models review automation systematic review scoping review
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This study aims to summarize the usage of large language models (LLMs) in the process of creating a scientific review by looking at the methodological papers that describe the use of LLMs in review automation and the review papers that mention they were made with the support of LLMs. The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on based on the OpenAI GPT-4o model. ChatGPT and Scite.ai were used in cleaning the data, generating the code for figures, and drafting the manuscript. Of the 3788 articles retrieved, 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n = 126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n = 26, 15.1%) were actual reviews that acknowledged LLM usage. Most citations focused on the automation of a particular stage of review, such as Searching for publications (n = 60, 34.9%) and Data extraction (n = 54, 31.4%). When comparing the pooled performance of GPT-based and BERT-based models, the former was better in data extraction with a mean precision of 83.0% (SD = 10.4) and a recall of 86.0% (SD = 9.8). Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. Despite limitations, such as lower accuracy of extraction for numeric data, we anticipate that LLMs will soon change the way scientific reviews are conducted.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 D. Scherbakov and N. Hubig contributed equally to this work.
ISSN:	1067-5027 1527-974X 1527-974X
DOI:	10.1093/jamia/ocaf063