A Theoretical Understanding of Self-Correction through In-context Alignment
Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i.e., correcting previous responses through self-examination, in certain circumstances. Neverthel...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
28.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Going beyond mimicking limited human experiences, recent studies show initial
evidence that, like humans, large language models (LLMs) are capable of
improving their abilities purely by self-correction, i.e., correcting previous
responses through self-examination, in certain circumstances. Nevertheless,
little is known about how such capabilities arise. In this work, based on a
simplified setup akin to an alignment task, we theoretically analyze
self-correction from an in-context learning perspective, showing that when LLMs
give relatively accurate self-examinations as rewards, they are capable of
refining responses in an in-context way. Notably, going beyond previous
theories on over-simplified linear transformers, our theoretical construction
underpins the roles of several key designs of realistic transformers for
self-correction: softmax attention, multi-head attention, and the MLP block. We
validate these findings extensively on synthetic datasets. Inspired by these
findings, we also illustrate novel applications of self-correction, such as
defending against LLM jailbreaks, where a simple self-correction step does make
a large difference. We believe that these findings will inspire further
research on understanding, exploiting, and enhancing self-correction for
building better foundation models. |
---|---|
DOI: | 10.48550/arxiv.2405.18634 |