Knowledge Distillation with Distribution Mismatch

Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the stu...

Full description

Saved in:
Bibliographic Details
Published inMachine Learning and Knowledge Discovery in Databases. Research Track Vol. 12976; pp. 250 - 265
Main Authors Nguyen, Dang, Gupta, Sunil, Nguyen, Trong, Rana, Santu, Nguyen, Phuoc, Tran, Truyen, Le, Ky, Ryan, Shannon, Venkatesh, Svetha
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2021
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Knowledge distillation (KD) is one of the most efficient methods to compress a large deep neural network (called teacher) to a smaller network (called student). Current state-of-the-art KD methods assume that the distributions of training data of teacher and student are identical to maintain the student’s accuracy close to the teacher’s accuracy. However, this strong assumption is not met in many real-world applications where the distribution mismatch happens between teacher’s training data and student’s training data. As a result, existing KD methods often fail in this case. To overcome this problem, we propose a novel method for KD process, which is still effective when the distribution mismatch happens. We first learn a distribution based on student’s training data, from which we can sample images well-classified by the teacher. By doing this, we can discover the data space where the teacher has good knowledge to transfer to the student. We then propose a new loss function to train the student network, which achieves better accuracy than the standard KD loss function. We conduct extensive experiments to demonstrate that our method works well for KD tasks with or without distribution mismatch. To the best of our knowledge, our method is the first method addressing the challenge of distribution mismatch when performing KD process.
ISBN:3030865193
9783030865191
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-030-86520-7_16