Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...

Full description

Saved in:
Bibliographic Details
Published inApplied sciences Vol. 12; no. 19; p. 9976
Main Authors Paul, Tanmoy, Islam, Humayera, Singh, Nitesh, Jampani, Yaswitha, Kotapati, Teja Venkat Pavan, Tautam, Preethi Aishwarya, Rana, Md Kamruz Zaman, Mandhadi, Vasanthi, Sharma, Vishakha, Barnes, Michael, Hammer, Richard D., Mosa, Abu Saleh Mohammad
Format Journal Article
LanguageEnglish
Published MDPI AG 01.10.2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.
AbstractList The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.
Author Barnes, Michael
Kotapati, Teja Venkat Pavan
Paul, Tanmoy
Mandhadi, Vasanthi
Singh, Nitesh
Tautam, Preethi Aishwarya
Rana, Md Kamruz Zaman
Mosa, Abu Saleh Mohammad
Sharma, Vishakha
Islam, Humayera
Hammer, Richard D.
Jampani, Yaswitha
Author_xml – sequence: 1
  givenname: Tanmoy
  orcidid: 0000-0002-0022-742X
  surname: Paul
  fullname: Paul, Tanmoy
– sequence: 2
  givenname: Humayera
  orcidid: 0000-0003-4915-4062
  surname: Islam
  fullname: Islam, Humayera
– sequence: 3
  givenname: Nitesh
  surname: Singh
  fullname: Singh, Nitesh
– sequence: 4
  givenname: Yaswitha
  surname: Jampani
  fullname: Jampani, Yaswitha
– sequence: 5
  givenname: Teja Venkat Pavan
  surname: Kotapati
  fullname: Kotapati, Teja Venkat Pavan
– sequence: 6
  givenname: Preethi Aishwarya
  surname: Tautam
  fullname: Tautam, Preethi Aishwarya
– sequence: 7
  givenname: Md Kamruz Zaman
  surname: Rana
  fullname: Rana, Md Kamruz Zaman
– sequence: 8
  givenname: Vasanthi
  surname: Mandhadi
  fullname: Mandhadi, Vasanthi
– sequence: 9
  givenname: Vishakha
  surname: Sharma
  fullname: Sharma, Vishakha
– sequence: 10
  givenname: Michael
  surname: Barnes
  fullname: Barnes, Michael
– sequence: 11
  givenname: Richard D.
  orcidid: 0000-0002-7173-9414
  surname: Hammer
  fullname: Hammer, Richard D.
– sequence: 12
  givenname: Abu Saleh Mohammad
  surname: Mosa
  fullname: Mosa, Abu Saleh Mohammad
BookMark eNpNkc9O3DAQxq2KSgXKqS_gOwr1v8T2EQK0K20pot1zNGuPI6Ngr-yAtE_RV262oIo5zMw3mu93-U7IUcoJCfnC2YWUln2F3Y4Lbq3V3QdyLJjuGqm4Pnq3fyJntT6ypSyXhrNj8mczxynOe5oDvUWYnwtWGhMFencQMDVrSOMzjNjcl-yw1pjG5goqetpPMUUHE73GZuUxzTEsco450R_Z40Q3h2f6AD7mKY97-oC7XOZKQy700r9Acgvl7le_7un94lsI9TP5GGCqePY2T8nm9uZ3_71Z__y26i_XjRNWzY1z1jNjjJKaaRReodBSKK1t4LwzwXMFrBMsSM63YLbQGd-atgUlXIC2k6dk9cr1GR6HXYlPUPZDhjj8O-QyDlDm6CYckIFwW2u4d63StrMKNVjvnHdbsfSFdf7KciXXWjD853E2HKIZ3kUj_wKyF4SB
Cites_doi 10.1136/amiajnl-2011-000163
10.1016/j.cosrev.2018.06.001
10.1016/j.ijmedinf.2014.07.002
10.1016/j.datak.2012.06.003
10.1007/s10115-016-1012-2
10.1109/SMC.2013.166
10.1093/jamia/ocx132
10.1155/2014/240403
10.3115/1572392.1572432
10.1016/j.jbi.2012.10.007
10.1186/1472-6947-8-32
10.1016/j.ijmedinf.2010.09.007
ContentType Journal Article
DBID AAYXX
CITATION
DOA
DOI 10.3390/app12199976
DatabaseName CrossRef
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Sciences (General)
EISSN 2076-3417
ExternalDocumentID oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd
10_3390_app12199976
GroupedDBID .4S
2XV
5VS
7XC
8CJ
8FE
8FG
8FH
AADQD
AAFWJ
AAYXX
ADBBV
ADMLS
AFKRA
AFPKN
AFZYC
ALMA_UNASSIGNED_HOLDINGS
APEBS
ARCSS
BCNDV
BENPR
CCPQU
CITATION
CZ9
D1I
D1J
D1K
GROUPED_DOAJ
IAO
IGS
ITC
K6-
K6V
KC.
KQ8
L6V
LK5
LK8
M7R
MODMG
M~E
OK1
P62
PHGZM
PHGZT
PIMPY
PROAC
TUS
PUEGO
ID FETCH-LOGICAL-c294t-cc9d088843707e2d4e27324779f1168fd14a0620f311ba8ba68d5855a42cfa563
IEDL.DBID DOA
ISSN 2076-3417
IngestDate Wed Aug 27 01:24:15 EDT 2025
Tue Jul 01 00:41:39 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 19
Language English
License https://creativecommons.org/licenses/by/4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c294t-cc9d088843707e2d4e27324779f1168fd14a0620f311ba8ba68d5855a42cfa563
ORCID 0000-0003-4915-4062
0000-0002-7173-9414
0000-0002-0022-742X
OpenAccessLink https://doaj.org/article/e0a2cb981dc5479694e7a9dccdcb2ccd
ParticipantIDs doaj_primary_oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd
crossref_primary_10_3390_app12199976
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2022-10-01
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-10-01
  day: 01
PublicationDecade 2020
PublicationTitle Applied sciences
PublicationYear 2022
Publisher MDPI AG
Publisher_xml – name: MDPI AG
References Krstev (ref_10) 2015; 39
Zhang (ref_25) 2015; 28
ref_13
ref_12
Jiang (ref_14) 2011; 18
Soysal (ref_21) 2018; 25
Aberdeen (ref_20) 2010; 79
ref_18
Wu (ref_19) 2017; 2017
Goyal (ref_8) 2018; 29
Sandin (ref_24) 2017; 52
Shaalan (ref_11) 2010; 3
Tang (ref_15) 2014; 2014
Nadeau (ref_7) 2007; 30
ref_23
ref_22
Li (ref_16) 2014; 83
Zhu (ref_4) 2013; 46
ref_1
Tsochantaridis (ref_17) 2005; 6
ref_3
ref_2
Saha (ref_6) 2013; 85
ref_9
ref_5
References_xml – ident: ref_9
– volume: 18
  start-page: 601
  year: 2011
  ident: ref_14
  article-title: A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1136/amiajnl-2011-000163
– volume: 28
  start-page: 649
  year: 2015
  ident: ref_25
  article-title: Character-Level Convolutional Networks for Text Classification
  publication-title: Adv. Neural Inf. Processing Syst.
– volume: 29
  start-page: 21
  year: 2018
  ident: ref_8
  article-title: Recent Named Entity Recognition and Classification Techniques: A Systematic Review
  publication-title: Comput. Sci. Rev.
  doi: 10.1016/j.cosrev.2018.06.001
– ident: ref_3
– volume: 39
  start-page: 43
  year: 2015
  ident: ref_10
  article-title: A Rule-Based System for Automatic de-Identification of Medical Narrative Texts
  publication-title: Informatica
– volume: 83
  start-page: 750
  year: 2014
  ident: ref_16
  article-title: De-Identification of Clinical Narratives through Writing Complexity Measures
  publication-title: Int. J. Med. Inform.
  doi: 10.1016/j.ijmedinf.2014.07.002
– volume: 2017
  start-page: 1812
  year: 2017
  ident: ref_19
  article-title: Clinical Named Entity Recognition Using Deep Learning Models
  publication-title: AMIA Annu. Symp. Proc.
– volume: 85
  start-page: 15
  year: 2013
  ident: ref_6
  article-title: Combining Multiple Classifiers Using Vote Based Classifier Ensemble Technique for Named Entity Recognition
  publication-title: Data Knowl. Eng.
  doi: 10.1016/j.datak.2012.06.003
– ident: ref_18
– volume: 52
  start-page: 267
  year: 2017
  ident: ref_24
  article-title: Random Indexing of Multidimensional Data
  publication-title: Knowl. Inf. Syst.
  doi: 10.1007/s10115-016-1012-2
– ident: ref_23
– ident: ref_5
  doi: 10.1109/SMC.2013.166
– volume: 25
  start-page: 331
  year: 2018
  ident: ref_21
  article-title: CLAMP—A Toolkit for Efficiently Building Customized Clinical Natural Language Processing Pipelines
  publication-title: J. Am. Med. Inform. Assoc.
  doi: 10.1093/jamia/ocx132
– ident: ref_2
– ident: ref_12
– volume: 2014
  start-page: 240403
  year: 2014
  ident: ref_15
  article-title: Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks
  publication-title: Biomed. Res. Int.
  doi: 10.1155/2014/240403
– volume: 6
  start-page: 1453
  year: 2005
  ident: ref_17
  article-title: Large Margin Methods for Structured and Interdependent Output Variables
  publication-title: J. Mach. Learn. Res.
– ident: ref_13
  doi: 10.3115/1572392.1572432
– volume: 3
  start-page: 11
  year: 2010
  ident: ref_11
  article-title: Rule-Based Approach in Arabic Natural Language Processing
  publication-title: Int. J. Inf. Commun. Technol.
– volume: 46
  start-page: 200
  year: 2013
  ident: ref_4
  article-title: Biomedical Text Mining and Its Applications in Cancer Research
  publication-title: J. Biomed. Inform.
  doi: 10.1016/j.jbi.2012.10.007
– volume: 30
  start-page: 3
  year: 2007
  ident: ref_7
  article-title: A Survey of Named Entity Recognition and Classification. Lingvisticae InvestigationesLingvisticæ InvestigationesLingvisticæ Investigationes
  publication-title: Int. J. Linguist. Lang. Resour.
– ident: ref_1
  doi: 10.1186/1472-6947-8-32
– ident: ref_22
– volume: 79
  start-page: 849
  year: 2010
  ident: ref_20
  article-title: The MITRE Identification Scrubber Toolkit: Design, Training, and Assessment
  publication-title: Int. J. Med. Inform.
  doi: 10.1016/j.ijmedinf.2010.09.007
SSID ssj0000913810
Score 2.2334569
Snippet The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity...
SourceID doaj
crossref
SourceType Open Website
Index Database
StartPage 9976
SubjectTerms conditional random field (CRF)
de-identification
named entity recognition (NER)
natural language processing (NLP)
protected health information
Title Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
URI https://doaj.org/article/e0a2cb981dc5479694e7a9dccdcb2ccd
Volume 12
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELUQLDAgPsVndUMHGCwSx3XikRaqCpWqAiqxRY4dS0ioRbQM_Ar-MnexQdlYGJIhiqzId7l7Z997ZqyrSEXJO8PTlCg5QhteJZXkmFwq0mqxqiBy8v1EjWby7rn33Drqi3rCgjxwmLirOjHCVhphle3JXCst69xoZ62zlcA7RV_Mea1iqonBOiXpqkDIy7Cup_3gVBDnntRFWimopdTfpJThDtuOWBCuwzfssrV6vse2WgqBe2w3_ntLuIgC0Zf77Gu2oo7WT1h4IAj3gSUzvMzBwMQ0Mhp8HFcheeQB4Fi8j_nKQdQBfYWbmgeSro-rdkDHor1C00MAD8YFHgsEhL4EBLdwHRsGYPI4GA9gGjRZlwdsNrx9Gox4PFiBW6HlilurHUaXQmZ5ktfCyRpBjJB5rj2arPAulSZRIvFZmlamqIwqHJYVPSOF9aanskO2Pl_M6yMGKvcusQ4TvdGIzAh9SF9kzmdK4mWPWfdnrsu3oJ9RYt1BJilbJjlmfbLD7ysket08QFcooyuUf7nCyX8Mcso2BTEcmn69M7a-ev-ozxF3rKoO2-jfTqYPncbVvgFNC9n8
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Utility+of+Features+in+a+Natural-Language-Processing-Based+Clinical+De-Identification+Model+Using+Radiology+Reports+for+Advanced+NSCLC+Patients&rft.jtitle=Applied+sciences&rft.au=Tanmoy+Paul&rft.au=Humayera+Islam&rft.au=Nitesh+Singh&rft.au=Yaswitha+Jampani&rft.date=2022-10-01&rft.pub=MDPI+AG&rft.eissn=2076-3417&rft.volume=12&rft.issue=19&rft.spage=9976&rft_id=info:doi/10.3390%2Fapp12199976&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2076-3417&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2076-3417&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2076-3417&client=summon