SPViT: Enabling Faster Vision Transformers via Latency-Aware Soft Token Pruning

Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Considering the computation complexity, the internal data pattern of ViTs, and the edge de...

Full description

Saved in:

Bibliographic Details
Published in	Computer Vision - ECCV 2022 Vol. 13671; pp. 620 - 640
Main Authors	Kong, Zhenglun, Dong, Peiyan, Ma, Xiaolong, Meng, Xin, Niu, Wei, Sun, Mengshu, Shen, Xuan, Yuan, Geng, Ren, Bin, Tang, Hao, Qin, Minghai, Wang, Yanzhi
Format	Book Chapter
Language	English
Published	Switzerland Springer 2022 Springer Nature Switzerland
Series	Lecture Notes in Computer Science
Subjects	FPGA Hardware acceleration Mobile devices Model compression Vision transformer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, Vision Transformer (ViT) has continuously established new milestones in the computer vision field, while the high computation and memory cost makes its propagation in industrial production difficult. Considering the computation complexity, the internal data pattern of ViTs, and the edge device deployment, we propose a latency-aware soft token pruning framework, SPViT, which can be set up on vanilla Transformers of both flatten and hierarchical structures, such as DeiTs and Swin-Transformers (Swin). More concretely, we design a dynamic attention-based multi-head token selector, which is a lightweight module for adaptive instance-wise token selection. We further introduce a soft pruning technique, which integrates the less informative tokens chosen by the selector module into a package token rather than discarding them completely. SPViT is bound to the trade-off between accuracy and latency requirements of specific edge devices through our proposed latency-aware training strategy. Experiment results show that SPViT significantly reduces the computation cost of ViTs with comparable performance on image classification. Moreover, SPViT can guarantee the identified model meets the latency specifications of mobile devices and FPGA, and even achieve the real-time execution of DeiT-T on mobile devices. For example, SPViT reduces the latency of DeiT-T to 26 ms (26%−41% superior to existing works) on the mobile device with 0.25%−4% higher top-1 accuracy on ImageNet. Our code is released at https://github.com/PeiyanFlying/SPViT.
Bibliography:	Z. Kong and P. Dong—Both authors contributed equally. Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20083-0_37.
ISBN:	3031200829 9783031200823
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-031-20083-0_37