TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition
Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major iss...
Saved in:
Published in | Computational Collective Intelligence Vol. 13501; pp. 755 - 767 |
---|---|
Main Authors | , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2022
Springer International Publishing |
Series | Lecture Notes in Computer Science |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Inspired by Transformer, one
of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major issue with ViT is to require massively trainable parameters. In this paper, we propose a novel compressed ViT model, namely Tensor-train ViT (TT-ViT), based on tensor-train (TT) decomposition. Consider a multi-head self-attention layer, instead of storing whole trainable matrices, we represent them in TT format via their TT cores using fewer parameters. The results of our experiments on CIFAR-10/Fashion-MNIST dataset reveal that TT-ViT achieves outstanding performance with equivalent accuracy to its baseline model, while total parameters of TT-ViT are just half of those of the baseline model. |
---|---|
Bibliography: | This research is funded by University of Science, VNU-HCM under grant number CNTT 2020-09. |
ISBN: | 9783031160134 3031160134 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-031-16014-1_59 |