Assigning clinical codes with data-driven concept representation on Dutch clinical free text

[Display omitted] •Clinical code assignment in a non-english setting.•Unsupervised medical concept extraction with an unlabelled corpus.•Distributional semantics to expand concept definitions. Clinical codes are used for public reporting purposes, are fundamental to determining public financing for...

Full description

Saved in:

Bibliographic Details
Published in	Journal of biomedical informatics Vol. 69; pp. 118 - 127
Main Authors	Scheurwegs, Elyne, Luyckx, Kim, Luyten, Léon, Goethals, Bart, Daelemans, Walter
Format	Journal Article
Language	English
Published	United States Elsevier Inc 01.05.2017
Subjects	Algorithms Clinical Coding Data mining Distributional semantics Electronic health records Humans International classification of diseases Knowledge Bases Language Natural Language Processing Netherlands Semantics Text mining Unsupervised learning Word2vec Text mining Word2vec International classification of diseases Clinical coding Data mining Distributional semantics Electronic health records Unsupervised learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	[Display omitted] •Clinical code assignment in a non-english setting.•Unsupervised medical concept extraction with an unlabelled corpus.•Distributional semantics to expand concept definitions. Clinical codes are used for public reporting purposes, are fundamental to determining public financing for hospitals, and form the basis for reimbursement claims to insurance providers. They are assigned to a patient stay to reflect the diagnosis and performed procedures during that stay. This paper aims to enrich algorithms for automated clinical coding by taking a data-driven approach and by using unsupervised and semi-supervised techniques for the extraction of multi-word expressions that convey a generalisable medical meaning (referred to as concepts). Several methods for extracting concepts from text are compared, two of which are constructed from a large unannotated corpus of clinical free text. A distributional semantic model (i.c. the word2vec skip-gram model) is used to generalize over concepts and retrieve relations between them. These methods are validated on three sets of patient stay data, in the disease areas of urology, cardiology, and gastroenterology. The datasets are in Dutch, which introduces a limitation on available concept definitions from expert-based ontologies (e.g. UMLS). The results show that when expert-based knowledge in ontologies is unavailable, concepts derived from raw clinical texts are a reliable alternative. Both concepts derived from raw clinical texts perform and concepts derived from expert-created dictionaries outperform a bag-of-words approach in clinical code assignment. Adding features based on tokens that appear in a semantically similar context has a positive influence for predicting diagnostic codes. Furthermore, the experiments indicate that a distributional semantics model can find relations between semantically related concepts in texts but also introduces erroneous and redundant relations, which can undermine clinical coding performance.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1532-0464 1532-0480
DOI:	10.1016/j.jbi.2017.04.007