Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

Aligning generative models with human preference via RLHF typically suffers from overoptimization, where an imperfectly learned reward model can misguide the generative model to output undesired responses. We investigate this problem in a principled manner by identifying the source of the misalignme...

Full description

Saved in:
Bibliographic Details
Main Authors Liu, Zhihan, Lu, Miao, Zhang, Shenao, Liu, Boyi, Guo, Hongyi, Yang, Yingxiang, Blanchet, Jose, Wang, Zhaoran
Format Journal Article
LanguageEnglish
Published 26.05.2024
Subjects
Online AccessGet full text

Cover

Loading…