Can large language models fully automate or partially assist paper selection in systematic reviews?

Background/aimsLarge language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail.MethodsWe introduced two LLM-based approaches of syste...

Full description

Saved in:

Bibliographic Details
Published in	British journal of ophthalmology Vol. 109; no. 8; pp. 962 - 966
Main Authors	Chen, Haichao, Jiang, Zehua, Liu, Xinyu, Xue, Can Can, Yew, Samantha Min Er, Sheng, Bin, Zheng, Ying-Feng, Wang, Xiaofei, Wu, You, Sivaprasad, Sobha, Wong, Tien Yin, Chaudhary, Varun, Tham, Yih Chung
Format	Journal Article
Language	English
Published	BMA House, Tavistock Square, London, WC1H 9JR BMJ Publishing Group Ltd 01.08.2025 BMJ Publishing Group LTD
Subjects	Automation Childrens health Diabetes Diabetic retinopathy Epidemiology Female Humans Language Large Language Models Natural language processing Ophthalmology Pregnancy Public health Research methodology Review Literature as Topic Systematic review Systematic Reviews as Topic Epidemiology Public health
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Background/aimsLarge language models (LLMs) have substantial potential to enhance the efficiency of academic research. The accuracy and performance of LLMs in a systematic review, a core part of evidence building, has yet to be studied in detail.MethodsWe introduced two LLM-based approaches of systematic review: an LLM-enabled fully automated approach (LLM-FA) utilising three different GPT-4 plugins (Consensus GPT, Scholar GPT and GPT web browsing modes) and an LLM-facilitated semi-automated approach (LLM-SA) using GPT4’s Application Programming Interface (API). We benchmarked these approaches using three published systematic reviews that reported the prevalence of diabetic retinopathy across different populations (general population, pregnant women and children).ResultsThe three published reviews consisted of 98 papers in total. Across these three reviews, in the LLM-FA approach, Consensus GPT correctly identified 32.7% (32 out of 98) of papers, while Scholar GPT and GPT4’s web browsing modes only identified 19.4% (19 out of 98) and 6.1% (6 out of 98), respectively. On the other hand, the LLM-SA approach not only successfully included 82.7% (81 out of 98) of these papers but also correctly excluded 92.2% of 4497 irrelevant papers.ConclusionsOur findings suggest LLMs are not yet capable of autonomously identifying and selecting relevant papers in systematic reviews. However, they hold promise as an assistive tool to improve the efficiency of the paper selection process in systematic reviews.
Bibliography:	Clinical science ObjectType-Article-2 SourceType-Scholarly Journals-1 content type line 14 ObjectType-Feature-3 ObjectType-Evidence Based Healthcare-1
ISSN:	0007-1161 1468-2079
DOI:	10.1136/bjo-2024-326254