Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts
Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effec...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
23.02.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Steering the behavior of a strong model pre-trained on internet-scale data
can be difficult due to the scarcity of competent supervisors. Recent studies
reveal that, despite supervisory noises, a strong student model may surpass its
weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of
such weak-to-strong generalization remains limited, especially in the presence
of large capability gaps. In this paper, we propose to address this challenge
by harnessing a diverse set of specialized teachers, instead of a single
generalist one, that collectively supervises the strong student. Our approach
resembles the classical hierarchical mixture of experts, with two components
tailored for co-supervision: (i) we progressively alternate student training
and teacher assignment, leveraging the growth of the strong student to identify
plausible supervisions; (ii) we conservatively enforce teacher-student and
local-global consistency, leveraging their dependencies to reject potential
annotation noises. We validate the proposed method through visual recognition
tasks on the OpenAI weak-to-strong benchmark and additional multi-domain
datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}. |
---|---|
DOI: | 10.48550/arxiv.2402.15505 |