Rep ViT: Revisiting Mobile CNN From ViT Perspective

Recently, lightweight Vision Transformers (ViTs) demon-strate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections be-tween lightweight ViTs and lightwei...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 15909 - 15920
Main Authors	Wang, Ao, Chen, Hui, Lin, Zijia, Han, Jungong, Ding, Guiguang
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Accuracy CNN Codes Computational modeling Computer vision Mobile handsets Performance evaluation Transformers ViT
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, lightweight Vision Transformers (ViTs) demon-strate superior performance and lower latency, compared with lightweight Convolutional Neural Networks (CNNs), on resource-constrained mobile devices. Researchers have discovered many structural connections be-tween lightweight ViTs and lightweight CNNs. However, the notable architectural disparities in the block structure, macro, and micro designs between them have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically, we incrementally enhance the mobile-friendliness of a standard lightweight CNN, i.e., MobileNetV3, by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably, on ImageNet, RepViT achieves over 80% top-1 accuracy with 1.0 ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Besides, when RepViT meets SAM, our RepViT-SAM can achieve nearly 10x faster inference than the advanced MobileSAM. Codes and models are available at https://github.com/THU-MIG/RepViT.
ISSN:	1063-6919
DOI:	10.1109/CVPR52733.2024.01506