Adaptive Knowledge Distillation Based on Entropy

Knowledge distillation (KD) approach is widely used in the deep learning field mainly for model size reduction. KD utilizes soft labels of teacher model, which contain the dark- knowledge that one-hot ground-truth does not have. This knowledge can improve the performance of already saturated student...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 7409 - 7413
Main Authors	Kwon, Kisoo, Na, Hwidong, Lee, Hoshik, Kim, Nam Soo
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2020
Subjects	Adaptation models automatic speech recognition Conferences Deep learning Entropy entropy criterion Knowledge distillation multiple teachers Signal processing Speech processing Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Knowledge distillation (KD) approach is widely used in the deep learning field mainly for model size reduction. KD utilizes soft labels of teacher model, which contain the dark- knowledge that one-hot ground-truth does not have. This knowledge can improve the performance of already saturated student model. In case of multiple-teacher models, generally, the same weighted average (interpolated training) of multiple-teacher's labels is applied to KD training. However, if the knowledge characteristics among teachers are somewhat different, the interpolated training can be at risk of crushing each knowledge characteristics and can also raise noise component. In this paper, we propose an entropy based KD training, which utilizes the teacher model labels with lower entropy at a larger rate among the various teacher models. The proposed method shows a better performance than the conventional KD training scheme in automatic speech recognition.
ISSN:	2379-190X
DOI:	10.1109/ICASSP40776.2020.9054698