Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work: does the desired safety refusal, typically enforced in chat contexts, generalize to non-chat and agentic use cases? Unlike...
Saved in:
Main Authors | , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Be the first to leave a comment!