Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
Recent advancements in generative AI have enabled ubiquitous access to large language models (LLMs). Empowered by their exceptional capabilities to understand and generate human-like text, these models are being increasingly integrated into our society. At the same time, there are also concerns on t...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.03.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recent advancements in generative AI have enabled ubiquitous access to large
language models (LLMs). Empowered by their exceptional capabilities to
understand and generate human-like text, these models are being increasingly
integrated into our society. At the same time, there are also concerns on the
potential misuse of this powerful technology, prompting defensive measures from
service providers. To overcome such protection, jailbreaking prompts have
recently emerged as one of the most effective mechanisms to circumvent security
restrictions and elicit harmful content originally designed to be prohibited.
Due to the rapid development of LLMs and their ease of access via natural
languages, the frontline of jailbreak prompts is largely seen in online forums
and among hobbyists. To gain a better understanding of the threat landscape of
semantically meaningful jailbreak prompts, we systemized existing prompts and
measured their jailbreak effectiveness empirically. Further, we conducted a
user study involving 92 participants with diverse backgrounds to unveil the
process of manually creating jailbreak prompts. We observed that users often
succeeded in jailbreak prompts generation regardless of their expertise in
LLMs. Building on the insights from the user study, we also developed a system
using AI as the assistant to automate the process of jailbreak prompt
generation. |
---|---|
DOI: | 10.48550/arxiv.2403.17336 |