Topic detection using paragraph vectors to support active learning in systematic reviews

[Display omitted] •We propose a topic detection method based on paragraph vectors.•The method is integrated with an active learner to accelerate citation screening.•The method outperforms LDA when applied to clinical and public health reviews. Systematic reviews require expert reviewers to manually...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical informatics Vol. 62; pp. 59 - 65
Main Authors Hashimoto, Kazuma, Kontonatsios, Georgios, Miwa, Makoto, Ananiadou, Sophia
Format Journal Article
LanguageEnglish
Published United States Elsevier Inc 01.08.2016
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:[Display omitted] •We propose a topic detection method based on paragraph vectors.•The method is integrated with an active learner to accelerate citation screening.•The method outperforms LDA when applied to clinical and public health reviews. Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
These authors contributed equally to this work.
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2016.06.001