Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 12; no. 19; p. 9976
Main Authors	Paul, Tanmoy, Islam, Humayera, Singh, Nitesh, Jampani, Yaswitha, Kotapati, Teja Venkat Pavan, Tautam, Preethi Aishwarya, Rana, Md Kamruz Zaman, Mandhadi, Vasanthi, Sharma, Vishakha, Barnes, Michael, Hammer, Richard D., Mosa, Abu Saleh Mohammad
Format	Journal Article
Language	English
Published	MDPI AG 01.10.2022
Subjects	conditional random field (CRF) de-identification named entity recognition (NER) natural language processing (NLP) protected health information
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app12199976