EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers
Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever higher recognition accuracies, due to the quadratic complexity of se...
Saved in:
Published in | Computer Vision - ECCV 2022 Vol. 13671; pp. 294 - 311 |
---|---|
Main Authors | , , , , , , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer
2022
Springer Nature Switzerland |
Series | Lecture Notes in Computer Science |
Online Access | Get full text |
Cover
Loading…
Summary: | Self-attention based models such as vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision. Despite increasingly stronger variants with ever higher recognition accuracies, due to the quadratic complexity of self-attention, existing ViTs are typically demanding in computation and model size. Although several successful design choices (e.g., the convolutions and hierarchical multi-stage structure) of prior CNNs have been reintroduced into recent ViTs, they are still not sufficient to meet the limited resource requirements of mobile devices. This motivates a very recent attempt to develop light ViTs based on the state-of-the-art MobileNet-v2, but still leaves a performance gap behind. In this work, pushing further along this under-studied direction we introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on-device efficiency. This is realized by introducing a highly cost-effective local-global-local (LGL) information exchange bottleneck based on optimal integration of self-attention and convolutions. For device-dedicated evaluation, rather than relying on inaccurate proxies like the number of FLOPs or parameters, we adopt a practical approach of focusing directly on on-device latency and, for the first time, energy efficiency. Extensive experiments on image classification, object detection and semantic segmentation validate high efficiency of our EdgeViTs when compared to the state-of-the-art efficient CNNs and ViTs in terms of accuracy-efficiency tradeoff on mobile hardware. Specifically, we show that our models are Pareto-optimal when both accuracy-latency and accuracy-energy tradeoffs are considered, achieving strict dominance over other ViTs in almost all cases and competing with the most efficient CNNs. Code is available at https://github.com/saic-fi/edgevit. |
---|---|
Bibliography: | Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20083-0_18. J. Pan—Work done during an internship at Samsung AI Cambridge. |
ISBN: | 3031200829 9783031200823 |
ISSN: | 0302-9743 1611-3349 |
DOI: | 10.1007/978-3-031-20083-0_18 |