A 50.4 GOPs/W FPGA-Based MobileNetV2 Accelerator using the Double-Layer MAC and DSP Efficiency Enhancement

Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over 40 \times(64 \times) reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 acc...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE Asian Solid-State Circuits Conference (A-SSCC) pp. 1 - 3
Main Authors Li, Jixuan, Chen, Jiabao, Un, Ka-Fai, Yu, Wei-Han, Mak, Pui-In, Martins, Rui P.
Format Conference Proceeding
LanguageEnglish
Published IEEE 07.11.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Convolutional neural network (CNN) models, e.g. MobileNetV2 [1] and Xception, are based on depthwise separable convolution. They exhibit over 40 \times(64 \times) reduction of the number of parameters (operations) when compared to the VGG16 for the ImageNet inference, while maintaining the TOP-1 accuracy at 72 %. With an 8-bit quantization, the required memory for storing the model can be further compressed by 4 \times. This multitude of model sizes compression facilitates real-time complex machine learning tasks implemented on a low-power FPGA apt for Internet-of-Things edge computation. Previous effect [2] has improved its computational energy efficiency by exploiting the model sparsity, but the effectiveness drops in already-compressed modern CNN models. As a result, further advancing the CNN accelerator's energy efficiency with new techniques is desirable. [3] is a scalable adder tree for energy-efficient depthwise separable convolution computation, and [4] is a frame-rate enhancement technique; both failed to handle the extensive memory access during separable convolution that dominates the power consumption of the CNN accelerators. Herein we propose a double-layer multiply-accumulate (MAC) scheme to evaluate two layers within the bottleneck layer in a pipelining manner. It results significant reduction of the memory access to the feature maps. On top of that we also innovate a double-operation digital signal processor (DSP) to enhance the throughput of the accelerator by benefiting the use of a high-precision DSP for computing two fixed-point operations in one clock cycle.
DOI:10.1109/A-SSCC53895.2021.9634838