On the Sensitivity of Reward Inference to Misspecified Human Models
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral eco...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
09.12.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Inferring reward functions from human behavior is at the center of value
alignment - aligning AI objectives with what we, humans, actually want. But
doing so relies on models of how humans behave given their objectives. After
decades of research in cognitive science, neuroscience, and behavioral
economics, obtaining accurate human models remains an open research topic. This
begs the question: how accurate do these models need to be in order for the
reward inference to be accurate? On the one hand, if small errors in the model
can lead to catastrophic error in inference, the entire framework of reward
learning seems ill-fated, as we will never have perfect models of human
behavior. On the other hand, if as our models improve, we can have a guarantee
that reward accuracy also improves, this would show the benefit of more work on
the modeling side. We study this question both theoretically and empirically.
We do show that it is unfortunately possible to construct small adversarial
biases in behavior that lead to arbitrarily large errors in the inferred
reward. However, and arguably more importantly, we are also able to identify
reasonable assumptions under which the reward inference error can be bounded
linearly in the error in the human model. Finally, we verify our theoretical
insights in discrete and continuous control tasks with simulated and human
data. |
---|---|
DOI: | 10.48550/arxiv.2212.04717 |