Approximating vision transformers for edge: variational inference and mixed-precision for multi-modal data Approximating vision transformers for edge: variational inference

Vision transformer (ViTs) models have shown higher accuracy, robustness and large volume data processing ability, creating new baselines and references for perception tasks. However, these advantages require large memory and high-performance processors and computing units, which makes model adaptabi...

Full description

Saved in:

Bibliographic Details
Published in	Computing Vol. 107; no. 3
Main Authors	Katare, Dewant, Leroux, Sam, Janssen, Marijn, Ding, Aaron Yi
Format	Journal Article
Language	English
Published	Vienna Springer Vienna 01.03.2025
Subjects	Artificial Intelligence Computer Appl. in Administrative Data Processing Computer Communication Networks Computer Science Information Systems Applications (incl.Internet) Regular Paper Software Engineering Variational parameters Multimodality Model approximation Edge AI Quantization Mixed precision Vision transformers
Online Access	Get full text
ISSN	0010-485X 1436-5057
DOI	10.1007/s00607-025-01427-w

Cover

Loading…

More Information
Summary:	Vision transformer (ViTs) models have shown higher accuracy, robustness and large volume data processing ability, creating new baselines and references for perception tasks. However, these advantages require large memory and high-performance processors and computing units, which makes model adaptability and deployment challenging within resource-constrained environments such as memory-restricted and battery-powered edge devices. This paper addresses the model deployment challenges by proposing a model approximation approach VI-ViT , for edge deployment using variational inference with mixed precision for processing multi-modalities, such as point clouds and images. Our experimental evaluation on the nuScenes and Waymo datasets show up to 37% and 31% reduction in model parameters and Flops while maintaining a mean average precision of 70.5 compared to 74.8 of the baseline model. This work presents a practical deployment approach for approximating and optimizing Vision Transformers for edge AI applications by balancing model metrics such as parameters, flops, latency, energy consumption, and accuracy, which can easily be adapted to other transformer models and datasets.
ISSN:	0010-485X 1436-5057
DOI:	10.1007/s00607-025-01427-w