Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike...

Full description

Saved in:

Bibliographic Details
Main Authors	Kumar, Priyanshu, Lau, Elaine, Vijayakumar, Saranya, Trinh, Tu, Team, Scale Red, Chang, Elaine, Robinson, Vaughn, Hendryx, Sean, Zhou, Shuyan, Fredrikson, Matt, Yue, Summer, Wang, Zifan
Format	Journal Article
Language	English
Published	11.10.2024
Subjects	Computer Science - Cryptography and Security Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!