Comparative Knowledge Distillation
In the era of large scale pretrained models, Knowledge Distillation (KD) serves an important role in transferring the wisdom of computationally heavy teacher models to lightweight, efficient student models while preserving performance. Traditional KD paradigms, however, assume readily available acce...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
03.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In the era of large scale pretrained models, Knowledge Distillation (KD)
serves an important role in transferring the wisdom of computationally heavy
teacher models to lightweight, efficient student models while preserving
performance. Traditional KD paradigms, however, assume readily available access
to teacher models for frequent inference -- a notion increasingly at odds with
the realities of costly, often proprietary, large scale models. Addressing this
gap, our paper considers how to minimize the dependency on teacher model
inferences in KD in a setting we term Few Teacher Inference Knowledge
Distillation (FTI KD). We observe that prevalent KD techniques and state of the
art data augmentation strategies fall short in this constrained setting.
Drawing inspiration from educational principles that emphasize learning through
comparison, we propose Comparative Knowledge Distillation (CKD), which
encourages student models to understand the nuanced differences in a teacher
model's interpretations of samples. Critically, CKD provides additional
learning signals to the student without making additional teacher calls. We
also extend the principle of CKD to groups of samples, enabling even more
efficient learning from limited teacher calls. Empirical evaluation across
varied experimental settings indicates that CKD consistently outperforms state
of the art data augmentation and KD techniques. |
---|---|
DOI: | 10.48550/arxiv.2311.02253 |