WebInf: Accelerating WebGPU-based In-browser DNN Inference via Adaptive Model Partitioning

Artificial intelligence (AI) model inference performance in browsers is constrained, and transmitting data to the server consumes substantial transfer time by cloud computing. In this paper, we investigate the status quo of cloud and browser processing and explore model computation partitioning meth...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS) pp. 2499 - 2506
Main Authors	Dong, Bing, Liu, Tianen, Li, Borui, Zhou, Xiaolei, Wang, Shuai, Xu, Zhao-Dong
Format	Conference Proceeding
Language	English
Published	IEEE 17.12.2023
Subjects	Adaptation models Adaptive systems Cloud computing Computational modeling Data models Edge device Model partitioning Natural language processing Neural network Neural networks WebGPU
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Artificial intelligence (AI) model inference performance in browsers is constrained, and transmitting data to the server consumes substantial transfer time by cloud computing. In this paper, we investigate the status quo of cloud and browser processing and explore model computation partitioning methods. Our study is rooted in WebGPU and employs the Tensorflow.js framework, encompassing seven AI models spanning computer vision, natural language processing, and automatic speech recognition domains. Leveraging the characteristics of neural network layers, we find a significant performance boost through a method that partitions AI models at layer granularity. We design a system called WebInf to partition AI models at layer granularity between the browser and server for faster inferencing-based adaptive model partitioning. WebInf supports diverse hardware, wireless networks, neural network structures, servers, and adaptive partitioning models for optimal inference performance. We evaluate WebInf on two laptops and servers, demonstrating that WebInf yields inference time improvements of 30% and 52%, respectively, when compared to separate inference execution in servers and browsers. The improvements can even peak at 33% and 69% respectively.
ISSN:	2690-5965
DOI:	10.1109/ICPADS60453.2023.00333