Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a...

Full description

Saved in:

Bibliographic Details
Main Authors	Draguns, Andis, Gritsevskiy, Andrew, Motwani, Sumeet Ramesh, Rogers-Smith, Charlie, Ladish, Jeffrey, de Witt, Christian Schroeder
Format	Journal Article
Language	English
Published	03.06.2024
Subjects	Computer Science - Cryptography and Security Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!