Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a...

Full description

Saved in:

Bibliographic Details
Main Authors	Draguns, Andis, Gritsevskiy, Andrew, Motwani, Sumeet Ramesh, Rogers-Smith, Charlie, Ladish, Jeffrey, de Witt, Christian Schroeder
Format	Journal Article
Language	English
Published	03.06.2024
Subjects	Computer Science - Cryptography and Security Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Abstract	The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.
AbstractList	The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in autoregressive transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to evaluate or detect ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.
Author	Motwani, Sumeet Ramesh de Witt, Christian Schroeder Gritsevskiy, Andrew Rogers-Smith, Charlie Draguns, Andis Ladish, Jeffrey
Author_xml	– sequence: 1 givenname: Andis surname: Draguns fullname: Draguns, Andis – sequence: 2 givenname: Andrew surname: Gritsevskiy fullname: Gritsevskiy, Andrew – sequence: 3 givenname: Sumeet Ramesh surname: Motwani fullname: Motwani, Sumeet Ramesh – sequence: 4 givenname: Charlie surname: Rogers-Smith fullname: Rogers-Smith, Charlie – sequence: 5 givenname: Jeffrey surname: Ladish fullname: Ladish, Jeffrey – sequence: 6 givenname: Christian Schroeder surname: de Witt fullname: de Witt, Christian Schroeder
BackLink	https://doi.org/10.48550/arXiv.2406.02619$$DView paper in arXiv
BookMark	eNqFjrsOgjAUQDvo4OsDnOwPiAWB6CrROGjigDO5loI3lpbcApG_NxJ3p7Oc5JwpGxlrFGNLX3jhLorEBuiNnReEIvZEEPv7CbvdjdIosYGHVvwA8pVbS46j4RcwZQul4lebK-14h8AT6uvGlgT1EyVPCYwrLFWKeIIkW2zcnI0L0E4tfpyx1emYJuf10M5qwgqoz74P2fCw_W98AIkKPtw
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2406.02619
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2406_02619
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2406_026193
IEDL.DBID	GOX
IngestDate	Tue Jun 18 04:50:32 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2406_026193
OpenAccessLink	https://arxiv.org/abs/2406.02619
ParticipantIDs	arxiv_primary_2406_02619
PublicationCentury	2000
PublicationDate	2024-06-03
PublicationDateYYYYMMDD	2024-06-03
PublicationDate_xml	– month: 06 year: 2024 text: 2024-06-03 day: 03
PublicationDecade	2020
PublicationYear	2024
Score	3.8558936
SecondaryResourceType	preprint
Snippet	The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Cryptography and Security Computer Science - Learning
Title	Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
URI	https://arxiv.org/abs/2406.02619
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV27TsMwFL1qO7EgEKDy9sAayMNu8AgRpUK8hlbKFjmOLVlEaeWkFfw9fqSCpat9ZV3ZwznHvvcY4CaVlMiQVgHlnAY45WVAZcIDXBFJcBQx3_X-9j6ZLfBLTvIBoG0vDNPfauP9gcv2zsLNrVMJQxjGsS3Zev7I_eOks-Lq4__iDMd0Q_9AYnoA-z27Qw_-OA5hIJoj-Fw0olbciPCyFuiR8a9qudQtUg167W8Lkf2SrG7RRjGU6Z9V542kFUfzLbEUGmVK87Xq2mO4nj7Ns1ngcihW3jCisOkVLr3kBEZG1ouxrSlK2YRHhgDIClNGynuKiZG2IWM0oiw8hfGuVc52T53DXmxg1xUzJRcw6vRaXBrY7Mort3e_2c9ylA
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Unelicitable+Backdoors+in+Language+Models+via+Cryptographic+Transformer+Circuits&rft.au=Draguns%2C+Andis&rft.au=Gritsevskiy%2C+Andrew&rft.au=Motwani%2C+Sumeet+Ramesh&rft.au=Rogers-Smith%2C+Charlie&rft.date=2024-06-03&rft_id=info:doi/10.48550%2Farxiv.2406.02619&rft.externalDocID=2406_02619