The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings

Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constrain...

Full description

Saved in:
Bibliographic Details
Published inComputers in biology and medicine Vol. 134; p. 104433
Main Authors Nath, Namrata, Lee, Sang-Heon, McDonnell, Mark D., Lee, Ivan
Format Journal Article
LanguageEnglish
Published United States Elsevier Ltd 01.07.2021
Elsevier Limited
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constraints derived from the Unified Medical Language System (UMLS). It also compares the performance of the augmented vectors with Bio + Clinical BERT vectors which have been trained and fine-tuned on clinical datasets. Word2vec vectors are generated for words in a publicly available de-identified Electronic Health Records (EHR) dataset and augmented by ontologies using three algorithms that have fundamentally different approaches to vector augmentation. The augmented vectors are then evaluated alongside publicly available Bio + Clinical BERT on their correlation with human-annotated lists using Spearman's correlation coefficient. They are also evaluated on the downstream task of Named Entity Recognition (NER). Quantitative and empirical evaluations are used to highlight the strengths and weaknesses of the different approaches. The counter-fitted word2vec vectors augmented with information from the UMLS ontology produced the best correlation overall with human-annotated evaluation lists (Spearman's correlation of 0.733 with mini mayo-doctors’ annotation) while Bio + Clinical BERT produces the best results in the NER task (F1 of 0.87 and 0.811 on the i2b2 2010 and i2b2 2012 datasets respectively) in our experiments. Clinically adapted word2vec vectors successfully encapsulate concepts of lexical and clinical synonymy and antonymy and to a smaller extent, hyponymy and hypernymy. Bio + Clinical BERT vectors perform better at NER and avoid out-of-vocabulary words. •Evaluated different styles of word2vec vector generation in the clinical context.•Evaluated various linguistic and domain adaptation algorithms and constraints.•Evaluated publicly available vector space models.•Spearman’s correlation of 0.73 with mini mayo doctors’ list for adapted vectors.•Bio+Clinical BERT gives best results on NER task using Bi-LSTM CRF.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2021.104433