Tell Me What You Don't Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing
Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs' performance when faced with different types of conflicting requ...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Role-Playing Agents (RPAs) have shown remarkable performance in various
applications, yet they often struggle to recognize and appropriately respond to
hard queries that conflict with their role-play knowledge. To investigate RPAs'
performance when faced with different types of conflicting requests, we develop
an evaluation benchmark that includes contextual knowledge conflicting
requests, parametric knowledge conflicting requests, and non-conflicting
requests to assess RPAs' ability to identify conflicts and refuse to answer
appropriately without over-refusing. Through extensive evaluation, we find that
most RPAs behave significant performance gaps toward different conflict
requests. To elucidate the reasons, we conduct an in-depth representation-level
analysis of RPAs under various conflict scenarios. Our findings reveal the
existence of rejection regions and direct response regions within the model's
forwarding representation, and thus influence the RPA's final response
behavior. Therefore, we introduce a lightweight representation editing approach
that conveniently shifts conflicting requests to the rejection region, thereby
enhancing the model's refusal accuracy. The experimental results validate the
effectiveness of our editing method, improving RPAs' refusal ability of
conflicting requests while maintaining their general role-playing capabilities. |
---|---|
DOI: | 10.48550/arxiv.2409.16913 |