Differentially private de-identifying textual medical document is compliant with challenging NLP analyses: Example of privacy-preserving ICD-10 code association

Medical research plays a crucial role within scientific research. Technological advancements, especially those related to the rise of machine learning, pave the way for the exploration of medical issues that were once beyond reach. Unstructured textual data, such as correspondence between doctors, o...

Full description

Saved in:
Bibliographic Details
Published inIntelligent systems with applications Vol. 23; p. 200416
Main Authors Tchouka, Yakini, Couchot, Jean-François, Laiymani, David, Selles, Philippe, Rahmani, Azzedine
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.09.2024
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Medical research plays a crucial role within scientific research. Technological advancements, especially those related to the rise of machine learning, pave the way for the exploration of medical issues that were once beyond reach. Unstructured textual data, such as correspondence between doctors, operative reports, etc., often serve as a starting point for many medical applications. However, for obvious privacy reasons, researchers do not legally have the right to access these documents as long as they contain sensitive data, as defined by regulations like GDPR (General Data Protection Regulation) or HIPAA (Health Insurance Portability and Accountability Act). De-identification, meaning the detection, removal or substitution of all sensitive information, is therefore a necessary step to facilitate the sharing of these data between the medical field and research. Over the past decade, various approaches have been proposed to de-identify medical textual data. However, while entity detection is a well-known task in the natural language processing field, it presents some specific challenges in the medical context. Moreover, existing substitution methods proposed in the literature often pay little attention to the medical relevance of de-identified data or are not very resilient to attacks. This paper addresses these challenges. Firstly, an efficient system for detecting sensitive entities in French medical data and then accurately substitute them was implemented. Secondly, robust strategies for generating substitutes that incorporate the medical utility of the data were provided, thereby minimizing the difference in utility between the original and de-identified data, and that mathematically ensure privacy protection. Thirdly, the utility of the de-identification system in a context of ICD-10 code association was evaluated. Finally, various systems developed to tackle ICD-10 code association were presented while providing a state-of-the-art model in French. •We developed a sensitive French NER model based on the Flaubert Transformer.•Trained on a constructed dataset, it represents the state-of-the-art in NER task in the French language within the context of de-identification with HIPAA attributes.•Substitutes for sensitive attributes are generated in the differential privacy context.•Surrogate generation approaches is available on GitHub https://github.com/healthinf/Surrogate-generation-Strategies-in-De-identification.•These contributions enabled the development of an incremental approach for dataset construction.•Various architectures to tackle issues of ICD coding have been developed.•An open-source implementation of this system is available on GitHub https://github.com/mlfiab/icd10-french.•Experimentally, our de-identification approach helps reduce the loss of utility (6.8%) compared to a traditional de-identification method (12%, e.g., replacing sensitive attributes with their labels).
ISSN:2667-3053
2667-3053
DOI:10.1016/j.iswa.2024.200416