Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of traini...

Full description

Saved in:

Bibliographic Details
Published in	AMIA ... Annual Symposium proceedings Vol. 2024; p. 339
Main Authors	Das, Avisha, Tariq, Amara, Batalini, Felipe, Dhara, Boddhisattwa, Banerjee, Imon
Format	Journal Article
Language	English
Published	United States 2024
Subjects	Breast Neoplasms Computer Security Female Humans Natural Language Processing
Online Access	Get full text
ISSN	1942-597X 1559-4076

Cover

Loading…

Abstract	Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
AbstractList	Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain. Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice. Despite their transformative impact on natural language processing (NLP), public LLMs present notable vulnerabilities given the source of training data is often web-based or crowdsourced, and hence can be manipulated by perpetrators. We delve into the vulnerabilities of clinical LLMs, particularly BioGPT which is trained on publicly available biomedical literature and clinical notes from MIMIC-III, in the realm of data poisoning attacks. Exploring susceptibility to data poisoning-based attacks on de-identified breast cancer clinical notes, our approach is the first one to assess the extent of such attacks and our findings reveal successful manipulation of LLM outputs. Through this work, we emphasize on the urgency of comprehending these vulnerabilities in LLMs, and encourage the mindful and responsible usage of LLMs in the clinical domain.
Author	Banerjee, Imon Dhara, Boddhisattwa Tariq, Amara Batalini, Felipe Das, Avisha
Author_xml	– sequence: 1 givenname: Avisha surname: Das fullname: Das, Avisha organization: Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona – sequence: 2 givenname: Amara surname: Tariq fullname: Tariq, Amara organization: Arizona Advanced AI & Innovation (A3I) Hub, Mayo Clinic Arizona – sequence: 3 givenname: Felipe surname: Batalini fullname: Batalini, Felipe organization: Department of Oncology, Mayo Clinic Arizona – sequence: 4 givenname: Boddhisattwa surname: Dhara fullname: Dhara, Boddhisattwa organization: BITS Pilani (Hyderabad), India – sequence: 5 givenname: Imon surname: Banerjee fullname: Banerjee, Imon organization: School of Computing and Augmented Intelligence, Arizona State University
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/40417494$$D View this record in MEDLINE/PubMed
BookMark	eNo1kF9LwzAUxYNM3B_9CpJHXwppmzStb7NuKlQUHOJbuU1vt2iX1iQF9-3tcD7dy-X8LuecOZmYzuAZmYVCZAFnMpmMe8ajQGTyY0rmzn0yxqVIkwsy5YyHkmd8RnD103dOmy19H1qDFirdaq_RUW1o3mqjFbS0KJ4d3exsN2x39B480NdOu84cuaX3oL7cLc3BIX3zQ304sncWwfnxaBTaS3LeQOvw6jQXZLNebfLHoHh5eMqXRdCHUepH2yoVNU8xbBoGwCVntQqjSCUVAjQKkkpKFjdMySSKUTYNpIKLqBYAaZzFC3Lz97a33feAzpd77RS2LRjsBlfGERv5MTwbpdcn6VDtsS57q_dgD-V_M_EvSaFitg
ContentType	Journal Article
Copyright	2024 AMIA - All rights reserved.
Copyright_xml	– notice: 2024 AMIA - All rights reserved.
DBID	CGR CUY CVF ECM EIF NPM 7X8
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE MEDLINE - Academic
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Medicine
EISSN	1559-4076
ExternalDocumentID	40417494
Genre	Journal Article
GroupedDBID	2WC 53G ADBBV ALMA_UNASSIGNED_HOLDINGS BAWUL CGR CUY CVF DIK E3Z ECM EIF GX1 HYE NPM OK1 RPM WOQ 7X8
ID	FETCH-LOGICAL-p128t-40c85d48e1ff0aa4740dc122c6beaafca6b7703f0c7623e7ffa85452d5aa8393
ISSN	1942-597X
IngestDate	Mon May 26 17:04:48 EDT 2025 Wed Jun 04 01:40:05 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Language	English
License	2024 AMIA - All rights reserved.
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-p128t-40c85d48e1ff0aa4740dc122c6beaafca6b7703f0c7623e7ffa85452d5aa8393
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
PMID	40417494
PQID	3207704040
PQPubID	23479
ParticipantIDs	proquest_miscellaneous_3207704040 pubmed_primary_40417494
PublicationCentury	2000
PublicationDate	2024-00-00 20240101
PublicationDateYYYYMMDD	2024-01-01
PublicationDate_xml	– year: 2024 text: 2024-00-00
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	AMIA ... Annual Symposium proceedings
PublicationTitleAlternate	AMIA Annu Symp Proc
PublicationYear	2024
References	38562849 - medRxiv. 2024 Mar 21:2024.03.20.24304627. doi: 10.1101/2024.03.20.24304627.
References_xml	– reference: 38562849 - medRxiv. 2024 Mar 21:2024.03.20.24304627. doi: 10.1101/2024.03.20.24304627.
SSID	ssj0047586
Score	2.322088
Snippet	Training Large Language Models (LLMs) with billions of parameters on a dataset and publishing the model for public access is the current standard practice....
SourceID	proquest pubmed
SourceType	Aggregation Database Index Database
StartPage	339
SubjectTerms	Breast Neoplasms Computer Security Female Humans Natural Language Processing
Title	Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer
URI	https://www.ncbi.nlm.nih.gov/pubmed/40417494 https://www.proquest.com/docview/3207704040
Volume	2024
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1ba9swFIBF6cPYy1jXXdqtRYO9mRgnlnzZW5q2tKMpg2Ujb0G2JGZonCxVusuv7zm6OGGssO3FGNlJjD756JyTcyHknSi1kmkhe7yf6x7LqxLeuVrBmeZ9kXHNJboGxtfZxWf2YcqnocW9zy4xVVz_-mNeyf9QhTHgilmy_0C2-1IYgHPgC0cgDMe_Ynz2A2OuwNb_sr7B6tE20LWxIVbRKKQ8Xl2Nb6OJb8dzKoyIPi4whMj6Q4zBHHt0C4xgO7NRhTYR8ARj1Q0MwppYbSuww_HlMIrjOPKF-T_9nOMzrOfRZivc6lXveh3fYYvnjY9g1Xyzw3Ox2rgD0I3U2P5S0bm6aZbdgjvFktJ2GS6k_IrRR-a72HZWuOzoWHnRykuwVl23lyB7u3uc-ExdYaMtdMu5ZccSBqZTyTa7VhdLGC5hAQOQ75jvPe2ifBiYQ9ilKtz0sB1h9YnJU_LEGwJ06KjukR3VPiOPxj7UYZ-oAJf-Bpc2LQ1wKcKlHi5FuLSDSz3c9xTRUosWP-vQUof2OZmcn01GFz3fE6O3BE3CwATWBZesUH2tEyFYzhJZ9weDOquUELoWWZWDENdJDbtcqnKtRYFt5CUXAnTh9AXZbRetekVoXmihWSFLXpYs06VQLM1Qn1HApJLFAXkbpmoGIgf_RxKtWqxvZ-kggd-AKU0OyEs3h7Olq40yCxN9-OCV1-QxUndurDdk16zW6ggUO1MdW3T35YBUWg
linkProvider	Geneva Foundation for Medical Education and Research
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Exposing+Vulnerabilities+in+Clinical+LLMs+Through+Data+Poisoning+Attacks%3A+Case+Study+in+Breast+Cancer&rft.jtitle=AMIA+...+Annual+Symposium+proceedings&rft.au=Das%2C+Avisha&rft.au=Tariq%2C+Amara&rft.au=Batalini%2C+Felipe&rft.au=Dhara%2C+Boddhisattwa&rft.date=2024&rft.eissn=1559-4076&rft.volume=2024&rft.spage=339&rft_id=info%3Apmid%2F40417494&rft.externalDocID=40417494
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1942-597X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1942-597X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1942-597X&client=summon