Large language models for conducting systematic reviews: on the rise, but not yet ready for use—a scoping review

Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health rese...

Full description

Saved in:
Bibliographic Details
Published inJournal of clinical epidemiology Vol. 181; p. 111746
Main Authors Lieberum, Judith-Lisa, Toews, Markus, Metzendorf, Maria-Inti, Heilmeyer, Felix, Siemens, Waldemar, Haverkamp, Christian, Böhringer, Daniel, Meerpohl, Joerg J., Eisele-Metzger, Angelika
Format Journal Article
LanguageEnglish
Published United States Elsevier Inc 01.05.2025
Elsevier Limited
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research. We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another. Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%). Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance. Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews. [Display omitted] •GPT was the most commonly used large language model (LLM).•LLM application included 10 of 13 defined SR steps, most often literature search.•Validation studies predominated, but fully established LLM applications are rare.•Our results highlight the increasing relevance of LLM use in the field.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Evidence Based Healthcare-4
content type line 14
ObjectType-Literature Review-2
ObjectType-Feature-3
ObjectType-Feature-2
content type line 23
ISSN:0895-4356
1878-5921
1878-5921
DOI:10.1016/j.jclinepi.2025.111746