RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data
Large vision-language models (LVLMs) often fail to align with human preferences, leading to issues like generating misleading content without proper visual context (also known as hallucination). A promising solution to this problem is using human-preference alignment techniques, such as best-of-n sa...
Saved in:
Main Authors | , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
21.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large vision-language models (LVLMs) often fail to align with human
preferences, leading to issues like generating misleading content without
proper visual context (also known as hallucination). A promising solution to
this problem is using human-preference alignment techniques, such as best-of-n
sampling and reinforcement learning. However, these techniques face the
difficulty arising from the scarcity of visual preference data, which is
required to train a visual reward model (VRM). In this work, we continue the
line of research. We present a Robust Visual Reward Model (RoVRM) which
improves human-preference alignment for LVLMs. RoVRM leverages auxiliary
textual preference data through a three-phase progressive training and optimal
transport-based preference data selection to effectively mitigate the scarcity
of visual preference data. We experiment with RoVRM on the commonly used
vision-language tasks based on the LLaVA-1.5-7B and -13B models. Experimental
results demonstrate that RoVRM consistently outperforms traditional VRMs.
Furthermore, our three-phase progressive training and preference data selection
approaches can yield consistent performance gains over ranking-based alignment
techniques, such as direct preference optimization. |
---|---|
DOI: | 10.48550/arxiv.2408.12109 |