TT-ViT: Vision Transformer Compression Using Tensor-Train Decomposition

Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major iss...

Full description

Saved in:
Bibliographic Details
Published inComputational Collective Intelligence Vol. 13501; pp. 755 - 767
Main Authors Pham Minh, Hoang, Nguyen Xuan, Nguyen, Tran Thai, Son
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2022
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Inspired by Transformer, one of the most successful deep learning models in natural language processing, machine translation, etc. Vision Transformer (ViT) has recently demonstrated its effectiveness in computer vision tasks such as image classification, object detection, etc. However, the major issue with ViT is to require massively trainable parameters. In this paper, we propose a novel compressed ViT model, namely Tensor-train ViT (TT-ViT), based on tensor-train (TT) decomposition. Consider a multi-head self-attention layer, instead of storing whole trainable matrices, we represent them in TT format via their TT cores using fewer parameters. The results of our experiments on CIFAR-10/Fashion-MNIST dataset reveal that TT-ViT achieves outstanding performance with equivalent accuracy to its baseline model, while total parameters of TT-ViT are just half of those of the baseline model.
Bibliography:This research is funded by University of Science, VNU-HCM under grant number CNTT 2020-09.
ISBN:9783031160134
3031160134
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-031-16014-1_59