The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review
Journal of the American Medical Informatics Association. 2025 May 7:ocaf063 Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
06.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Journal of the American Medical Informatics Association. 2025 May
7:ocaf063 Objective: This study aims to summarize the usage of Large Language Models
(LLMs) in the process of creating a scientific review. We look at the range of
stages in a review that can be automated and assess the current
state-of-the-art research projects in the field. Materials and Methods: The
search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google
Scholar databases by human reviewers. Screening and extraction process took
place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.
ChatGPT was used to clean extracted data and generate code for figures in this
manuscript, ChatGPT and Scite.ai were used in drafting all components of the
manuscript, except the methods and discussion sections. Results: 3,788 articles
were retrieved, and 172 studies were deemed eligible for the final review.
ChatGPT and GPT-based LLM emerged as the most dominant architecture for review
automation (n=126, 73.2%). A significant number of review automation projects
were found, but only a limited number of papers (n=26, 15.1%) were actual
reviews that used LLM during their creation. Most citations focused on
automation of a particular stage of review, such as Searching for publications
(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled
performance of GPT-based and BERT-based models, the former were better in data
extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),
while being slightly less accurate in title and abstract screening stage
(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic
review revealed a significant number of research projects related to review
automation using LLMs. The results looked promising, and we anticipate that
LLMs will change in the near future the way the scientific reviews are
conducted. |
---|---|
DOI: | 10.48550/arxiv.2409.04600 |