Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of traini...

Full description

Saved in:
Bibliographic Details
Published inAMIA ... Annual Symposium proceedings Vol. 2024; p. 339
Main Authors Das, Avisha, Tariq, Amara, Batalini, Felipe, Dhara, Boddhisattwa, Banerjee, Imon
Format Journal Article
LanguageEnglish
Published United States 2024
Subjects
Online AccessGet full text
ISSN1942-597X
1559-4076

Cover

Loading…
Abstract Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
AbstractList Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
Author Banerjee, Imon
Dhara, Boddhisattwa
Tariq, Amara
Batalini, Felipe
Das, Avisha
Author_xml – sequence: 1
  givenname: Avisha
  surname: Das
  fullname: Das, Avisha
  organization: Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona
– sequence: 2
  givenname: Amara
  surname: Tariq
  fullname: Tariq, Amara
  organization: Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona
– sequence: 3
  givenname: Felipe
  surname: Batalini
  fullname: Batalini, Felipe
  organization: Department of Oncology, Mayo Clinic Arizona
– sequence: 4
  givenname: Boddhisattwa
  surname: Dhara
  fullname: Dhara, Boddhisattwa
  organization: BITS Pilani (Hyderabad), India
– sequence: 5
  givenname: Imon
  surname: Banerjee
  fullname: Banerjee, Imon
  organization: School of Computing and Augmented Intelligence, Arizona State University
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40417494$$D View this record in MEDLINE/PubMed
BookMark eNo1kF9LwzAUxYNM3B_9CpJHXwppmzStb7NuKlQUHOJbuU1vt2iX1iQF9-3tcD7dy-X8LuecOZmYzuAZmYVCZAFnMpmMe8ajQGTyY0rmzn0yxqVIkwsy5YyHkmd8RnD103dOmy19H1qDFirdaq_RUW1o3mqjFbS0KJ4d3exsN2x39B480NdOu84cuaX3oL7cLc3BIX3zQ304sncWwfnxaBTaS3LeQOvw6jQXZLNebfLHoHh5eMqXRdCHUepH2yoVNU8xbBoGwCVntQqjSCUVAjQKkkpKFjdMySSKUTYNpIKLqBYAaZzFC3Lz97a33feAzpd77RS2LRjsBlfGERv5MTwbpdcn6VDtsS57q_dgD-V_M_EvSaFitg
ContentType Journal Article
Copyright 2024 AMIA - All rights reserved.
Copyright_xml – notice: 2024 AMIA - All rights reserved.
DBID CGR
CUY
CVF
ECM
EIF
NPM
7X8
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
EISSN 1559-4076
ExternalDocumentID 40417494
Genre Journal Article
GroupedDBID 2WC
53G
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BAWUL
CGR
CUY
CVF
DIK
E3Z
ECM
EIF
GX1
HYE
NPM
OK1
RPM
WOQ
7X8
ID FETCH-LOGICAL-p128t-40c85d48e1ff0aa4740dc122c6beaafca6b7703f0c7623e7ffa85452d5aa8393
ISSN 1942-597X
IngestDate Mon May 26 17:04:48 EDT 2025
Wed Jun 04 01:40:05 EDT 2025
IsPeerReviewed true
IsScholarly true
Language English
License 2024 AMIA - All rights reserved.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-p128t-40c85d48e1ff0aa4740dc122c6beaafca6b7703f0c7623e7ffa85452d5aa8393
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
PMID 40417494
PQID 3207704040
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3207704040
pubmed_primary_40417494
PublicationCentury 2000
PublicationDate 2024-00-00
20240101
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – year: 2024
  text: 2024-00-00
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle AMIA ... Annual Symposium proceedings
PublicationTitleAlternate AMIA Annu Symp Proc
PublicationYear 2024
References 38562849 - medRxiv. 2024 Mar 21:2024.03.20.24304627. doi: 10.1101/2024.03.20.24304627.
References_xml – reference: 38562849 - medRxiv. 2024 Mar 21:2024.03.20.24304627. doi: 10.1101/2024.03.20.24304627.
SSID ssj0047586
Score 2.322088
Snippet Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice....
SourceID proquest
pubmed
SourceType Aggregation Database
Index Database
StartPage 339
SubjectTerms Breast Neoplasms
Computer Security
Female
Humans
Natural Language Processing
Title Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer
URI https://www.ncbi.nlm.nih.gov/pubmed/40417494
https://www.proquest.com/docview/3207704040
Volume 2024
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1ba9swFIBF6cPYy1jXXdqtRYO9mRgnlnzZW5q2tKMpg2Ujb0G2JGZonCxVusuv7zm6OGGssO3FGNlJjD756JyTcyHknSi1kmkhe7yf6x7LqxLeuVrBmeZ9kXHNJboGxtfZxWf2YcqnocW9zy4xVVz_-mNeyf9QhTHgilmy_0C2-1IYgHPgC0cgDMe_Ynz2A2OuwNb_sr7B6tE20LWxIVbRKKQ8Xl2Nb6OJb8dzKoyIPi4whMj6Q4zBHHt0C4xgO7NRhTYR8ARj1Q0MwppYbSuww_HlMIrjOPKF-T_9nOMzrOfRZivc6lXveh3fYYvnjY9g1Xyzw3Ox2rgD0I3U2P5S0bm6aZbdgjvFktJ2GS6k_IrRR-a72HZWuOzoWHnRykuwVl23lyB7u3uc-ExdYaMtdMu5ZccSBqZTyTa7VhdLGC5hAQOQ75jvPe2ifBiYQ9ilKtz0sB1h9YnJU_LEGwJ06KjukR3VPiOPxj7UYZ-oAJf-Bpc2LQ1wKcKlHi5FuLSDSz3c9xTRUosWP-vQUof2OZmcn01GFz3fE6O3BE3CwATWBZesUH2tEyFYzhJZ9weDOquUELoWWZWDENdJDbtcqnKtRYFt5CUXAnTh9AXZbRetekVoXmihWSFLXpYs06VQLM1Qn1HApJLFAXkbpmoGIgf_RxKtWqxvZ-kggd-AKU0OyEs3h7Olq40yCxN9-OCV1-QxUndurDdk16zW6ggUO1MdW3T35YBUWg
linkProvider Geneva Foundation for Medical Education and Research
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Exposing+Vulnerabilities+in+Clinical+LLMs+Through+Data+Poisoning+Attacks%3A+Case+Study+in+Breast+Cancer&rft.jtitle=AMIA+...+Annual+Symposium+proceedings&rft.au=Das%2C+Avisha&rft.au=Tariq%2C+Amara&rft.au=Batalini%2C+Felipe&rft.au=Dhara%2C+Boddhisattwa&rft.date=2024&rft.eissn=1559-4076&rft.volume=2024&rft.spage=339&rft_id=info%3Apmid%2F40417494&rft.externalDocID=40417494
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1942-597X&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1942-597X&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1942-597X&client=summon