On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching Regularization
Accurately aligning large language models (LLMs) with human preferences is crucial for informing fair, economically sound, and statistically efficient decision-making processes. However, we argue that reinforcement learning from human feedback (RLHF) -- the predominant approach for aligning LLMs wit...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
26.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Accurately aligning large language models (LLMs) with human preferences is
crucial for informing fair, economically sound, and statistically efficient
decision-making processes. However, we argue that reinforcement learning from
human feedback (RLHF) -- the predominant approach for aligning LLMs with human
preferences through a reward model -- suffers from an inherent algorithmic bias
due to its Kullback--Leibler-based regularization in optimization. In extreme
cases, this bias could lead to a phenomenon we term preference collapse, where
minority preferences are virtually disregarded. To mitigate this algorithmic
bias, we introduce preference matching (PM) RLHF, a novel approach that
provably aligns LLMs with the preference distribution of the reward model under
the Bradley--Terry--Luce/Plackett--Luce model. Central to our approach is a PM
regularizer that takes the form of the negative logarithm of the LLM's policy
probability distribution over responses, which helps the LLM balance response
diversification and reward maximization. Notably, we obtain this regularizer by
solving an ordinary differential equation that is necessary for the PM
property. For practical implementation, we introduce a conditional variant of
PM RLHF that is tailored to natural language generation. Finally, we
empirically validate the effectiveness of conditional PM RLHF through
experiments on the OPT-1.3B and Llama-2-7B models, demonstrating a 29% to 41%
improvement in alignment with human preferences, as measured by a certain
metric, compared to standard RLHF. |
---|---|
DOI: | 10.48550/arxiv.2405.16455 |