Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a...

Full description

Saved in:
Bibliographic Details
Main Authors Draguns, Andis, Gritsevskiy, Andrew, Motwani, Sumeet Ramesh, Rogers-Smith, Charlie, Ladish, Jeffrey, de Witt, Christian Schroeder
Format Journal Article
LanguageEnglish
Published 03.06.2024
Subjects
Online AccessGet full text

Cover

Loading…