Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
03.06.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The rapid proliferation of open-source language models significantly
increases the risks of downstream backdoor attacks. These backdoors can
introduce dangerous behaviours during model deployment and can evade detection
by conventional cybersecurity monitoring systems. In this paper, we introduce a
novel class of backdoors in autoregressive transformer models, that, in
contrast to prior art, are unelicitable in nature. Unelicitability prevents the
defender from triggering the backdoor, making it impossible to evaluate or
detect ahead of deployment even if given full white-box access and using
automated techniques, such as red-teaming or certain formal verification
methods. We show that our novel construction is not only unelicitable thanks to
using cryptographic techniques, but also has favourable robustness properties.
We confirm these properties in empirical investigations, and provide evidence
that our backdoors can withstand state-of-the-art mitigation strategies.
Additionally, we expand on previous work by showing that our universal
backdoors, while not completely undetectable in white-box settings, can be
harder to detect than some existing designs. By demonstrating the feasibility
of seamlessly integrating backdoors into transformer models, this paper
fundamentally questions the efficacy of pre-deployment detection strategies.
This offers new insights into the offence-defence balance in AI safety and
security. |
---|---|
AbstractList | The rapid proliferation of open-source language models significantly
increases the risks of downstream backdoor attacks. These backdoors can
introduce dangerous behaviours during model deployment and can evade detection
by conventional cybersecurity monitoring systems. In this paper, we introduce a
novel class of backdoors in autoregressive transformer models, that, in
contrast to prior art, are unelicitable in nature. Unelicitability prevents the
defender from triggering the backdoor, making it impossible to evaluate or
detect ahead of deployment even if given full white-box access and using
automated techniques, such as red-teaming or certain formal verification
methods. We show that our novel construction is not only unelicitable thanks to
using cryptographic techniques, but also has favourable robustness properties.
We confirm these properties in empirical investigations, and provide evidence
that our backdoors can withstand state-of-the-art mitigation strategies.
Additionally, we expand on previous work by showing that our universal
backdoors, while not completely undetectable in white-box settings, can be
harder to detect than some existing designs. By demonstrating the feasibility
of seamlessly integrating backdoors into transformer models, this paper
fundamentally questions the efficacy of pre-deployment detection strategies.
This offers new insights into the offence-defence balance in AI safety and
security. |
Author | Motwani, Sumeet Ramesh de Witt, Christian Schroeder Gritsevskiy, Andrew Rogers-Smith, Charlie Draguns, Andis Ladish, Jeffrey |
Author_xml | – sequence: 1 givenname: Andis surname: Draguns fullname: Draguns, Andis – sequence: 2 givenname: Andrew surname: Gritsevskiy fullname: Gritsevskiy, Andrew – sequence: 3 givenname: Sumeet Ramesh surname: Motwani fullname: Motwani, Sumeet Ramesh – sequence: 4 givenname: Charlie surname: Rogers-Smith fullname: Rogers-Smith, Charlie – sequence: 5 givenname: Jeffrey surname: Ladish fullname: Ladish, Jeffrey – sequence: 6 givenname: Christian Schroeder surname: de Witt fullname: de Witt, Christian Schroeder |
BackLink | https://doi.org/10.48550/arXiv.2406.02619$$DView paper in arXiv |
BookMark | eNqFjrsOgjAUQDvo4OsDnOwPiAWB6CrROGjigDO5loI3lpbcApG_NxJ3p7Oc5JwpGxlrFGNLX3jhLorEBuiNnReEIvZEEPv7CbvdjdIosYGHVvwA8pVbS46j4RcwZQul4lebK-14h8AT6uvGlgT1EyVPCYwrLFWKeIIkW2zcnI0L0E4tfpyx1emYJuf10M5qwgqoz74P2fCw_W98AIkKPtw |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2406.02619 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2406_02619 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2406_026193 |
IEDL.DBID | GOX |
IngestDate | Tue Jun 18 04:50:32 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2406_026193 |
OpenAccessLink | https://arxiv.org/abs/2406.02619 |
ParticipantIDs | arxiv_primary_2406_02619 |
PublicationCentury | 2000 |
PublicationDate | 2024-06-03 |
PublicationDateYYYYMMDD | 2024-06-03 |
PublicationDate_xml | – month: 06 year: 2024 text: 2024-06-03 day: 03 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 3.8558936 |
SecondaryResourceType | preprint |
Snippet | The rapid proliferation of open-source language models significantly
increases the risks of downstream backdoor attacks. These backdoors can
introduce... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Cryptography and Security Computer Science - Learning |
Title | Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits |
URI | https://arxiv.org/abs/2406.02619 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV27TsMwFL1qO7EgEKDy9sAayMNu8AgRpUK8hlbKFjmOLVlEaeWkFfw9fqSCpat9ZV3ZwznHvvcY4CaVlMiQVgHlnAY45WVAZcIDXBFJcBQx3_X-9j6ZLfBLTvIBoG0vDNPfauP9gcv2zsLNrVMJQxjGsS3Zev7I_eOks-Lq4__iDMd0Q_9AYnoA-z27Qw_-OA5hIJoj-Fw0olbciPCyFuiR8a9qudQtUg167W8Lkf2SrG7RRjGU6Z9V542kFUfzLbEUGmVK87Xq2mO4nj7Ns1ngcihW3jCisOkVLr3kBEZG1ouxrSlK2YRHhgDIClNGynuKiZG2IWM0oiw8hfGuVc52T53DXmxg1xUzJRcw6vRaXBrY7Mort3e_2c9ylA |
link.rule.ids | 228,230,786,891 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Unelicitable+Backdoors+in+Language+Models+via+Cryptographic+Transformer+Circuits&rft.au=Draguns%2C+Andis&rft.au=Gritsevskiy%2C+Andrew&rft.au=Motwani%2C+Sumeet+Ramesh&rft.au=Rogers-Smith%2C+Charlie&rft.date=2024-06-03&rft_id=info:doi/10.48550%2Farxiv.2406.02619&rft.externalDocID=2406_02619 |