Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition

[Display omitted] •Past approaches to health-domain NER have mainly used manual features and conventional classifiers.•In this paper, we explore a neural network approach (B-LSTM-CRF) that can learn the features automatically.•In addition, initializing the features with pre-trained embeddings can le...

Full description

Saved in:
Bibliographic Details
Published inJournal of biomedical informatics Vol. 76; pp. 102 - 109
Main Authors Jauregi Unanue, Iñigo, Zare Borzeshi, Ehsan, Piccardi, Massimo
Format Journal Article
LanguageEnglish
Published United States Elsevier Inc 01.12.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:[Display omitted] •Past approaches to health-domain NER have mainly used manual features and conventional classifiers.•In this paper, we explore a neural network approach (B-LSTM-CRF) that can learn the features automatically.•In addition, initializing the features with pre-trained embeddings can lead to higher accuracy.•We pre-train the features using a critical care database (MIMIC-III).•Experiments have been carried out over three contemporary datasets for health-domain NER, outperforming past systems. Previous state-of-the-art systems on Drug Name Recognition (DNR) and Clinical Concept Extraction (CCE) have focused on a combination of text “feature engineering” and conventional machine learning algorithms such as conditional random fields and support vector machines. However, developing good features is inherently heavily time-consuming. Conversely, more modern machine learning approaches such as recurrent neural networks (RNNs) have proved capable of automatically learning effective features from either random assignments or automated word “embeddings”. (i) To create a highly accurate DNR and CCE system that avoids conventional, time-consuming feature engineering. (ii) To create richer, more specialized word embeddings by using health domain datasets such as MIMIC-III. (iii) To evaluate our systems over three contemporary datasets. Two deep learning methods, namely the Bidirectional LSTM and the Bidirectional LSTM-CRF, are evaluated. A CRF model is set as the baseline to compare the deep learning systems to a traditional machine learning approach. The same features are used for all the models. We have obtained the best results with the Bidirectional LSTM-CRF model, which has outperformed all previously proposed systems. The specialized embeddings have helped to cover unusual words in DrugBank and MedLine, but not in the i2b2/VA dataset. We present a state-of-the-art system for DNR and CCE. Automated word embeddings has allowed us to avoid costly feature engineering and achieve higher accuracy. Nevertheless, the embeddings need to be retrained over datasets that are adequate for the domain, in order to adequately cover the domain-specific vocabulary.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1532-0464
1532-0480
DOI:10.1016/j.jbi.2017.11.007