In‐depth evaluation of machine learning methods for semi‐automating article screening in a systematic review of mechanistic literature

We aimed to evaluate the performance of supervised machine learning algorithms in predicting articles relevant for full‐text review in a systematic review. Overall, 16,430 manually screened titles/s, including 861 references identified relevant for full‐text review were used for the analysis. Of the...

Full description

Saved in:

Bibliographic Details
Published in	Research synthesis methods Vol. 14; no. 2; pp. 156 - 172
Main Authors	Kebede, Mihiretu M., Le Cornet, Charlotte, Fortner, Renée Turzanski
Format	Journal Article
Language	English
Published	England Wiley 01.03.2023 Wiley Subscription Services, Inc
Subjects	Algorithms Artificial Intelligence automated screening Automation Bayes Theorem citation screening Data Collection Literature Reviews Machine Learning natural language processing Neural networks NLP Performance evaluation Research Methodology Sampling Sampling methods Screening Sensitivity analysis Sensitivity and Specificity Singular value decomposition Supervised learning Support vector machines Systematic review text mining Validity automated screening citation screening NLP natural language processing text mining machine learning systematic review
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We aimed to evaluate the performance of supervised machine learning algorithms in predicting articles relevant for full‐text review in a systematic review. Overall, 16,430 manually screened titles/s, including 861 references identified relevant for full‐text review were used for the analysis. Of these, 40% (n = 6573) were sub‐divided for training (70%) and testing (30%) the algorithms. The remaining 60% (n = 9857) were used as a validation set. We evaluated down‐ and up‐sampling methods and compared unigram, bigram, and singular value decomposition (SVD) approaches. For each approach, Naïve Bayes, Support Vector Machines (SVM), regularized logistic regressions, neural networks, random forest, Logit boost, and XGBoost were implemented using simple term frequency or Tf‐Idf feature representations. Performance was evaluated using sensitivity, specificity, precision and area under the Curve. We combined predictions of the best‐performing algorithms (Youden Index ≥0.3 with sensitivity/specificity≥70/60%). In a down‐sample unigram approach, Naïve Bayes, SVM/quanteda text models with Tf‐Idf, and linear SVM e1071 package with Tf‐Idf achieved >90% sensitivity at specificity >65%. Combining the predictions of the 10 best‐performing algorithms improved the performance to reach 95% sensitivity and 64% specificity in the validation set. Crude screening burden was reduced by 61% (5979) (adjusted: 80.3%) with 5% (27) false negativity rate. All the other approaches yielded relatively poorer performances. The down‐sampling unigram approach achieved good performance in our data. Combining the predictions of algorithms improved sensitivity while screening burden was reduced by almost two‐third. Implementing machine learning approaches in title/ screening should be investigated further toward refining these tools and automating their implementation.
Bibliography:	Funding information World Cancer Research Fund; Wereld Kanker Onderzoek Fonds ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 ObjectType-Undefined-3
ISSN:	1759-2879 1759-2887
DOI:	10.1002/jrsm.1589