Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike...

Full description

Saved in:
Bibliographic Details
Main Authors Kumar, Priyanshu, Lau, Elaine, Vijayakumar, Saranya, Trinh, Tu, Team, Scale Red, Chang, Elaine, Robinson, Vaughn, Hendryx, Sean, Zhou, Shuyan, Fredrikson, Matt, Yue, Summer, Wang, Zifan
Format Journal Article
LanguageEnglish
Published 11.10.2024
Subjects
Online AccessGet full text

Cover

Loading…