Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients
The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies la...
Saved in:
Published in | Applied sciences Vol. 12; no. 19; p. 9976 |
---|---|
Main Authors | , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
MDPI AG
01.10.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. |
---|---|
AbstractList | The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature. |
Author | Barnes, Michael Kotapati, Teja Venkat Pavan Paul, Tanmoy Mandhadi, Vasanthi Singh, Nitesh Tautam, Preethi Aishwarya Rana, Md Kamruz Zaman Mosa, Abu Saleh Mohammad Sharma, Vishakha Islam, Humayera Hammer, Richard D. Jampani, Yaswitha |
Author_xml | – sequence: 1 givenname: Tanmoy orcidid: 0000-0002-0022-742X surname: Paul fullname: Paul, Tanmoy – sequence: 2 givenname: Humayera orcidid: 0000-0003-4915-4062 surname: Islam fullname: Islam, Humayera – sequence: 3 givenname: Nitesh surname: Singh fullname: Singh, Nitesh – sequence: 4 givenname: Yaswitha surname: Jampani fullname: Jampani, Yaswitha – sequence: 5 givenname: Teja Venkat Pavan surname: Kotapati fullname: Kotapati, Teja Venkat Pavan – sequence: 6 givenname: Preethi Aishwarya surname: Tautam fullname: Tautam, Preethi Aishwarya – sequence: 7 givenname: Md Kamruz Zaman surname: Rana fullname: Rana, Md Kamruz Zaman – sequence: 8 givenname: Vasanthi surname: Mandhadi fullname: Mandhadi, Vasanthi – sequence: 9 givenname: Vishakha surname: Sharma fullname: Sharma, Vishakha – sequence: 10 givenname: Michael surname: Barnes fullname: Barnes, Michael – sequence: 11 givenname: Richard D. orcidid: 0000-0002-7173-9414 surname: Hammer fullname: Hammer, Richard D. – sequence: 12 givenname: Abu Saleh Mohammad surname: Mosa fullname: Mosa, Abu Saleh Mohammad |
BookMark | eNpNkc9O3DAQxq2KSgXKqS_gOwr1v8T2EQK0K20pot1zNGuPI6Ngr-yAtE_RV262oIo5zMw3mu93-U7IUcoJCfnC2YWUln2F3Y4Lbq3V3QdyLJjuGqm4Pnq3fyJntT6ypSyXhrNj8mczxynOe5oDvUWYnwtWGhMFencQMDVrSOMzjNjcl-yw1pjG5goqetpPMUUHE73GZuUxzTEsco450R_Z40Q3h2f6AD7mKY97-oC7XOZKQy700r9Acgvl7le_7un94lsI9TP5GGCqePY2T8nm9uZ3_71Z__y26i_XjRNWzY1z1jNjjJKaaRReodBSKK1t4LwzwXMFrBMsSM63YLbQGd-atgUlXIC2k6dk9cr1GR6HXYlPUPZDhjj8O-QyDlDm6CYckIFwW2u4d63StrMKNVjvnHdbsfSFdf7KciXXWjD853E2HKIZ3kUj_wKyF4SB |
Cites_doi | 10.1136/amiajnl-2011-000163 10.1016/j.cosrev.2018.06.001 10.1016/j.ijmedinf.2014.07.002 10.1016/j.datak.2012.06.003 10.1007/s10115-016-1012-2 10.1109/SMC.2013.166 10.1093/jamia/ocx132 10.1155/2014/240403 10.3115/1572392.1572432 10.1016/j.jbi.2012.10.007 10.1186/1472-6947-8-32 10.1016/j.ijmedinf.2010.09.007 |
ContentType | Journal Article |
DBID | AAYXX CITATION DOA |
DOI | 10.3390/app12199976 |
DatabaseName | CrossRef DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering Sciences (General) |
EISSN | 2076-3417 |
ExternalDocumentID | oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd 10_3390_app12199976 |
GroupedDBID | .4S 2XV 5VS 7XC 8CJ 8FE 8FG 8FH AADQD AAFWJ AAYXX ADBBV ADMLS AFKRA AFPKN AFZYC ALMA_UNASSIGNED_HOLDINGS APEBS ARCSS BCNDV BENPR CCPQU CITATION CZ9 D1I D1J D1K GROUPED_DOAJ IAO IGS ITC K6- K6V KC. KQ8 L6V LK5 LK8 M7R MODMG M~E OK1 P62 PHGZM PHGZT PIMPY PROAC TUS PUEGO |
ID | FETCH-LOGICAL-c294t-cc9d088843707e2d4e27324779f1168fd14a0620f311ba8ba68d5855a42cfa563 |
IEDL.DBID | DOA |
ISSN | 2076-3417 |
IngestDate | Wed Aug 27 01:24:15 EDT 2025 Tue Jul 01 00:41:39 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 19 |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c294t-cc9d088843707e2d4e27324779f1168fd14a0620f311ba8ba68d5855a42cfa563 |
ORCID | 0000-0003-4915-4062 0000-0002-7173-9414 0000-0002-0022-742X |
OpenAccessLink | https://doaj.org/article/e0a2cb981dc5479694e7a9dccdcb2ccd |
ParticipantIDs | doaj_primary_oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd crossref_primary_10_3390_app12199976 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2022-10-01 |
PublicationDateYYYYMMDD | 2022-10-01 |
PublicationDate_xml | – month: 10 year: 2022 text: 2022-10-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | Applied sciences |
PublicationYear | 2022 |
Publisher | MDPI AG |
Publisher_xml | – name: MDPI AG |
References | Krstev (ref_10) 2015; 39 Zhang (ref_25) 2015; 28 ref_13 ref_12 Jiang (ref_14) 2011; 18 Soysal (ref_21) 2018; 25 Aberdeen (ref_20) 2010; 79 ref_18 Wu (ref_19) 2017; 2017 Goyal (ref_8) 2018; 29 Sandin (ref_24) 2017; 52 Shaalan (ref_11) 2010; 3 Tang (ref_15) 2014; 2014 Nadeau (ref_7) 2007; 30 ref_23 ref_22 Li (ref_16) 2014; 83 Zhu (ref_4) 2013; 46 ref_1 Tsochantaridis (ref_17) 2005; 6 ref_3 ref_2 Saha (ref_6) 2013; 85 ref_9 ref_5 |
References_xml | – ident: ref_9 – volume: 18 start-page: 601 year: 2011 ident: ref_14 article-title: A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries publication-title: J. Am. Med. Inform. Assoc. doi: 10.1136/amiajnl-2011-000163 – volume: 28 start-page: 649 year: 2015 ident: ref_25 article-title: Character-Level Convolutional Networks for Text Classification publication-title: Adv. Neural Inf. Processing Syst. – volume: 29 start-page: 21 year: 2018 ident: ref_8 article-title: Recent Named Entity Recognition and Classification Techniques: A Systematic Review publication-title: Comput. Sci. Rev. doi: 10.1016/j.cosrev.2018.06.001 – ident: ref_3 – volume: 39 start-page: 43 year: 2015 ident: ref_10 article-title: A Rule-Based System for Automatic de-Identification of Medical Narrative Texts publication-title: Informatica – volume: 83 start-page: 750 year: 2014 ident: ref_16 article-title: De-Identification of Clinical Narratives through Writing Complexity Measures publication-title: Int. J. Med. Inform. doi: 10.1016/j.ijmedinf.2014.07.002 – volume: 2017 start-page: 1812 year: 2017 ident: ref_19 article-title: Clinical Named Entity Recognition Using Deep Learning Models publication-title: AMIA Annu. Symp. Proc. – volume: 85 start-page: 15 year: 2013 ident: ref_6 article-title: Combining Multiple Classifiers Using Vote Based Classifier Ensemble Technique for Named Entity Recognition publication-title: Data Knowl. Eng. doi: 10.1016/j.datak.2012.06.003 – ident: ref_18 – volume: 52 start-page: 267 year: 2017 ident: ref_24 article-title: Random Indexing of Multidimensional Data publication-title: Knowl. Inf. Syst. doi: 10.1007/s10115-016-1012-2 – ident: ref_23 – ident: ref_5 doi: 10.1109/SMC.2013.166 – volume: 25 start-page: 331 year: 2018 ident: ref_21 article-title: CLAMP—A Toolkit for Efficiently Building Customized Clinical Natural Language Processing Pipelines publication-title: J. Am. Med. Inform. Assoc. doi: 10.1093/jamia/ocx132 – ident: ref_2 – ident: ref_12 – volume: 2014 start-page: 240403 year: 2014 ident: ref_15 article-title: Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks publication-title: Biomed. Res. Int. doi: 10.1155/2014/240403 – volume: 6 start-page: 1453 year: 2005 ident: ref_17 article-title: Large Margin Methods for Structured and Interdependent Output Variables publication-title: J. Mach. Learn. Res. – ident: ref_13 doi: 10.3115/1572392.1572432 – volume: 3 start-page: 11 year: 2010 ident: ref_11 article-title: Rule-Based Approach in Arabic Natural Language Processing publication-title: Int. J. Inf. Commun. Technol. – volume: 46 start-page: 200 year: 2013 ident: ref_4 article-title: Biomedical Text Mining and Its Applications in Cancer Research publication-title: J. Biomed. Inform. doi: 10.1016/j.jbi.2012.10.007 – volume: 30 start-page: 3 year: 2007 ident: ref_7 article-title: A Survey of Named Entity Recognition and Classification. Lingvisticae InvestigationesLingvisticæ InvestigationesLingvisticæ Investigationes publication-title: Int. J. Linguist. Lang. Resour. – ident: ref_1 doi: 10.1186/1472-6947-8-32 – ident: ref_22 – volume: 79 start-page: 849 year: 2010 ident: ref_20 article-title: The MITRE Identification Scrubber Toolkit: Design, Training, and Assessment publication-title: Int. J. Med. Inform. doi: 10.1016/j.ijmedinf.2010.09.007 |
SSID | ssj0000913810 |
Score | 2.2334569 |
Snippet | The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity... |
SourceID | doaj crossref |
SourceType | Open Website Index Database |
StartPage | 9976 |
SubjectTerms | conditional random field (CRF) de-identification named entity recognition (NER) natural language processing (NLP) protected health information |
Title | Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients |
URI | https://doaj.org/article/e0a2cb981dc5479694e7a9dccdcb2ccd |
Volume | 12 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV09T8MwELUQLDAgPsVndUMHGCwSx3XikRaqCpWqAiqxRY4dS0ioRbQM_Ar-MnexQdlYGJIhiqzId7l7Z997ZqyrSEXJO8PTlCg5QhteJZXkmFwq0mqxqiBy8v1EjWby7rn33Drqi3rCgjxwmLirOjHCVhphle3JXCst69xoZ62zlcA7RV_Mea1iqonBOiXpqkDIy7Cup_3gVBDnntRFWimopdTfpJThDtuOWBCuwzfssrV6vse2WgqBe2w3_ntLuIgC0Zf77Gu2oo7WT1h4IAj3gSUzvMzBwMQ0Mhp8HFcheeQB4Fi8j_nKQdQBfYWbmgeSro-rdkDHor1C00MAD8YFHgsEhL4EBLdwHRsGYPI4GA9gGjRZlwdsNrx9Gox4PFiBW6HlilurHUaXQmZ5ktfCyRpBjJB5rj2arPAulSZRIvFZmlamqIwqHJYVPSOF9aanskO2Pl_M6yMGKvcusQ4TvdGIzAh9SF9kzmdK4mWPWfdnrsu3oJ9RYt1BJilbJjlmfbLD7ysket08QFcooyuUf7nCyX8Mcso2BTEcmn69M7a-ev-ozxF3rKoO2-jfTqYPncbVvgFNC9n8 |
linkProvider | Directory of Open Access Journals |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Utility+of+Features+in+a+Natural-Language-Processing-Based+Clinical+De-Identification+Model+Using+Radiology+Reports+for+Advanced+NSCLC+Patients&rft.jtitle=Applied+sciences&rft.au=Tanmoy+Paul&rft.au=Humayera+Islam&rft.au=Nitesh+Singh&rft.au=Yaswitha+Jampani&rft.date=2022-10-01&rft.pub=MDPI+AG&rft.eissn=2076-3417&rft.volume=12&rft.issue=19&rft.spage=9976&rft_id=info:doi/10.3390%2Fapp12199976&rft.externalDBID=DOA&rft.externalDocID=oai_doaj_org_article_e0a2cb981dc5479694e7a9dccdcb2ccd |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2076-3417&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2076-3417&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2076-3417&client=summon |