TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition

Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major iss...

Full description

Saved in:

Bibliographic Details
Published in	Computational Collective Intelligence Vol. 13501; pp. 755 - 767
Main Authors	Pham Minh, Hoang, Nguyen Xuan, Nguyen, Tran Thai, Son
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2022 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Model compression Tensor decomposition Tensor-train decomposition Vision transformer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major issue with ViT is to require massively trainable parameters. In this paper, we propose a novel compressed ViT model, namely Tensor-train ViT (TT-ViT), based on tensor-train (TT) decomposition. Consider a multi-head self-attention layer, instead of storing whole trainable matrices, we represent them in TT format via their TT cores using fewer parameters. The results of our experiments on CIFAR-10/Fashion-MNIST dataset reveal that TT-ViT achieves outstanding performance with equivalent accuracy to its baseline model, while total parameters of TT-ViT are just half of those of the baseline model.
Bibliography:	This research is funded by University of Science, VNU-HCM under grant number CNTT 2020-09.
ISBN:	9783031160134 3031160134
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-031-16014-1_59