Performance Evaluation of Deep Learning Compilers for Edge Inference

Recently, edge computing has received considerable attention as a promising means to provide Deep Learning (DL) based services. However, due to the limited computation capability of the data processing units (such as CPUs, GPUs, and specialized accelerators) in edge devices, using the devices'...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 858 - 865
Main Authors	Verma, Gaurav, Gupta, Yashi, Malik, Abid M., Chapman, Barbara
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2021
Subjects	Compilers for DL Computational modeling Deep learning Inference at Edge Performance evaluation Power demand TensorFlow Lite TensorFlow-TensorRT Tensors Throughput Transforms
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, edge computing has received considerable attention as a promising means to provide Deep Learning (DL) based services. However, due to the limited computation capability of the data processing units (such as CPUs, GPUs, and specialized accelerators) in edge devices, using the devices' limited resources efficiently is a challenge that affects deep learning-based analysis services. This has led to the development of several inference compilers such as TensorRT, TensorFlow Lite, Relay, and TVM, which optimize DL inference models specifically for edge devices. These compilers operate on the standard DL models available for inferencing in various frameworks, e.g., PyTorch, TensorFlow, Caffe, PaddlePaddle, and transform them into a corresponding lightweight model. TensorFlow Lite and TensorRT are considered state-of-the-art inference compilers and encompass most of the compiler optimization techniques that have been proposed for edge computing. This paper presents a detailed performance study of TensorFlow Lite (TFLite) and TensorFlow TensorRT (TF-TRT) using commonly employed DL models for edge devices on varying hardware platforms. The work compares throughput, latency performance, and power consumption. We find that the integrated TF-TRT consistently performs better at the high precision floating point on different DL architectures, especially with GPUs using tensor cores. However, it loses its edge for model compression to TFLite at low precision. TFLite which is primarily designed for mobile applications, performs better with lightweight DL models than the deep neural network-based models. It is the first detailed performance comparison of TF-TRT and TFLite inference compilers to the best of our knowledge.
DOI:	10.1109/IPDPSW52791.2021.00128