Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models

Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and effici...

Full description

Saved in:
Bibliographic Details
Published inApplied sciences Vol. 14; no. 20; p. 9180
Main Authors Zhang, Kai, Li, Jinqiu, Wang, Bingqian, Meng, Haoran
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 10.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERTTINY-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERTTINY-KD and exceeds the performance of BERT4-PKD and DistilBERT4 by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERTBASE, the BERTTINY-AMKD model retains over 96.3% of the performance of the teacher model, BERTBASE.
ISSN:2076-3417
2076-3417
DOI:10.3390/app14209180