Attention and feature transfer based knowledge distillation

Existing knowledge distillation (KD) methods are mainly based on features, logic, or attention, where features and logic represent the results of reasoning at different stages of a convolutional neural network, and attention maps symbolize the reasoning process. Because of the continuity of the two...

Full description

Saved in:

Bibliographic Details
Published in	Scientific reports Vol. 13; no. 1; pp. 18369 - 10
Main Authors	Yang, Guoliang, Yu, Shuaiying, Sheng, Yangyang, Yang, Hao
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 26.10.2023 Nature Publishing Group Nature Portfolio
Subjects	639/705/117 639/705/258 Distillation Humanities and Social Sciences Information processing Knowledge multidisciplinary Neural networks Science Science (multidisciplinary) Students
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Existing knowledge distillation (KD) methods are mainly based on features, logic, or attention, where features and logic represent the results of reasoning at different stages of a convolutional neural network, and attention maps symbolize the reasoning process. Because of the continuity of the two in time, transferring only one of them to the student network will lead to unsatisfactory results. We study the knowledge transfer between the teacher-student network to different degrees, revealing the importance of simultaneously transferring knowledge related to the reasoning process and reasoning results to the student network, providing a new perspective for the study of KD. On this basis, we proposed the knowledge distillation method based on attention and feature transfer (AFT-KD). First, we use transformation structures to transform intermediate features into attentional and feature block (AFB) that contain both inference process information and inference outcome information, and force students to learn the knowledge in AFBs. To save computation in the learning process, we use block operations to align the teacher-student network. In addition, in order to balance the attenuation ratio between different losses, we design an adaptive loss function based on the loss optimization rate. Experiments have shown that AFT-KD achieves state-of-the-art performance in multiple benchmark tests.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-023-43986-y