What is Lost in Knowledge Distillation?

Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by th...

Full description

Saved in:

Bibliographic Details
Main Authors	Mohanty, Manas, Roosta, Tanya, Passban, Peyman
Format	Journal Article
Language	English
Published	07.11.2023
Subjects	Computer Science - Computation and Language
Online Access	Get full text

Cover

Loading…

Abstract	Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by this, our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern. Our experiments aim to shed light on the type of tasks might be less or more sensitive to KD by reporting data points on the contribution of different factors, such as the number of layers or attention heads. Results such as ours could be utilized when determining effective and efficient configurations to achieve optimal information transfers between larger (teacher) and smaller (student) models.
AbstractList	Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue; however, the compression process could be lossy. Motivated by this, our work investigates how a distilled student model differs from its teacher, if the distillation process causes any information losses, and if the loss follows a specific pattern. Our experiments aim to shed light on the type of tasks might be less or more sensitive to KD by reporting data points on the contribution of different factors, such as the number of layers or attention heads. Results such as ours could be utilized when determining effective and efficient configurations to achieve optimal information transfers between larger (teacher) and smaller (student) models.
Author	Roosta, Tanya Passban, Peyman Mohanty, Manas
Author_xml	– sequence: 1 givenname: Manas surname: Mohanty fullname: Mohanty, Manas – sequence: 2 givenname: Tanya surname: Roosta fullname: Roosta, Tanya – sequence: 3 givenname: Peyman surname: Passban fullname: Passban, Peyman
BackLink	https://doi.org/10.48550/arXiv.2311.04142$$DView paper in arXiv
BookMark	eNotzjlvwjAYxnEPZSjHB2BqNqakPl87E0JcRURiQWKMnNimloKDkojj23O0z_LfHv366CPUwSI0JjjhSgj8rZubvySUEZJgTjj9RJPDr-4i30ZZ3T4bom2or5U1RxstfNv5qtKdr8N0iHpOV60d_XeA9qvlfv4TZ7v1Zj7LYg2SxgRSq0qFJafcgnYGCBSmNMpQRYUsXQniOSmVAaCFBSdYKqnCKXdFwQgboK-_27c0Pzf-pJt7_hLnbzF7AJQaOz8
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2311.04142
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2311_04142
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a672-169e8c807424e6afd616bdcd8d28257cfc65555778d662be6f539728094fbb313
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:43:22 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a672-169e8c807424e6afd616bdcd8d28257cfc65555778d662be6f539728094fbb313
OpenAccessLink	https://arxiv.org/abs/2311.04142
ParticipantIDs	arxiv_primary_2311_04142
PublicationCentury	2000
PublicationDate	2023-11-07
PublicationDateYYYYMMDD	2023-11-07
PublicationDate_xml	– month: 11 year: 2023 text: 2023-11-07 day: 07
PublicationDecade	2020
PublicationYear	2023
Score	1.8995914
SecondaryResourceType	preprint
Snippet	Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques,...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computation and Language
Title	What is Lost in Knowledge Distillation?
URI	https://arxiv.org/abs/2311.04142
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV3NSwQhFJdtT12iqNg-8RB0smYcfTqniGpb-rxsMLdBR4WF2GJnN_rzezqz1CUvgj6Ep-j7PX3vJyFnkcNbGyGYEdwwgRCcWVCCoWluEMEK41NU5fMLTN7EQyWrAaHrXBiz-J59dfzAtr1E8JFfZCIXeMhucB5Dtu5fq-5xMlFx9fK_cogxU9MfIzHeJls9uqPX3XLskIGf75LzSI9NZy19-mixntPH9T0WvY077L0LR7vaI9Px3fRmwvr_CZgBxVkOpddNJJPhwoMJDnKwrnHaxXxQ1YQGJBaltAPg1kOQaPy5RocqWFvkxT4ZoovvR4RK7mTICq-NFjgWHoa4jcom8xJsCKE8IKOkVf3ZUVDUUeE6KXz4f9cR2Yyfo6fMOXVMhsvFyp-gCV3a0zSPP7oAbns
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=What+is+Lost+in+Knowledge+Distillation%3F&rft.au=Mohanty%2C+Manas&rft.au=Roosta%2C+Tanya&rft.au=Passban%2C+Peyman&rft.date=2023-11-07&rft_id=info:doi/10.48550%2Farxiv.2311.04142&rft.externalDocID=2311_04142