PET: Parameter-efficient Knowledge Distillation on Transformer

Given a large Transformer model, how can we obtain a small and computationally efficient model which maintains the performance of the original model? Transformer has shown significant performance improvements for many NLP tasks in recent years. However, their large size, expensive computational cost...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 18; no. 7; p. e0288060
Main Authors	Jeon, Hyojin, Park, Seungcheol, Kim, Jin-Gee, Kang, U
Format	Journal Article
Language	English
Published	United States Public Library of Science 06.07.2023 Public Library of Science (PLoS)
Subjects	Accuracy Analysis Artificial intelligence Biology and Life Sciences Coders Compression Computational efficiency Computational linguistics Computer and Information Sciences Distillation Energy consumption Inference Knowledge Language Language processing Machine translation Mathematical models Mental task performance Methods Modelling Natural language interfaces Parameter identification People and Places Research and Analysis Methods Social Sciences Transformers South Korea United Kingdom
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Given a large Transformer model, how can we obtain a small and computationally efficient model which maintains the performance of the original model? Transformer has shown significant performance improvements for many NLP tasks in recent years. However, their large size, expensive computational cost, and long inference time make it challenging to deploy them to resource-constrained devices. Existing Transformer compression methods mainly focus on reducing the size of the encoder ignoring the fact that the decoder takes the major portion of the long inference time. In this paper, we propose PET (Parameter-Efficient knowledge distillation on Transformer), an efficient Transformer compression method that reduces the size of both the encoder and decoder. In PET, we identify and exploit pairs of parameter groups for efficient weight sharing, and employ a warm-up process using a simplified task to increase the gain through Knowledge Distillation. Extensive experiments on five real-world datasets show that PET outperforms existing methods in machine translation tasks. Specifically, on the IWSLT'14 EN→DE task, PET reduces the memory usage by 81.20% and accelerates the inference speed by 45.15% compared to the uncompressed model, with a minor decrease in BLEU score of 0.27.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Competing Interests: The authors have declared that no competing interests exit.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0288060