Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...
Saved in:
Published in | Applied sciences Vol. 12; no. 19; p. 9976 |
---|---|
Main Authors | , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
MDPI AG
01.10.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. |
---|---|
ISSN: | 2076-3417 2076-3417 |
DOI: | 10.3390/app12199976 |