Knowledge Distillation with Distribution Mismatch

Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the stu...

Full description

Saved in:

Bibliographic Details
Published in	Machine Learning and Knowledge Discovery in Databases. Research Track Vol. 12976; pp. 250 - 265
Main Authors	Nguyen, Dang, Gupta, Sunil, Nguyen, Trong, Rana, Santu, Nguyen, Phuoc, Tran, Truyen, Le, Ky, Ryan, Shannon, Venkatesh, Svetha
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2021 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Distribution mismatch Distribution shift Knowledge distillation Mismatched teacher Model compression
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the student’s accuracy close to the teacher’s accuracy. However, this strong assumption is not met in many real-world applications where the distribution mismatch happens between teacher’s training data and student’s training data. As a result, existing KD methods often fail in this case. To overcome this problem, we propose a novel method for KD process, which is still effective when the distribution mismatch happens. We first learn a distribution based on student’s training data, from which we can sample images well-classified by the teacher. By doing this, we can discover the data space where the teacher has good knowledge to transfer to the student. We then propose a new loss function to train the student network, which achieves better accuracy than the standard KD loss function. We conduct extensive experiments to demonstrate that our method works well for KD tasks with or without distribution mismatch. To the best of our knowledge, our method is the first method addressing the challenge of distribution mismatch when performing KD process.
ISBN:	3030865193 9783030865191
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-86520-7_16