Design and Implementation of Convolutional Neural Networks Accelerator Based on Multidie

To achieve real-time object detection tasks with high throughput and low latency, this paper proposes a multi-die hardware accelerator architecture. It implements three accelerators on the VU9P chip, each of which is bound to an independent super logic region (SLR). To reduce off-chip memory access...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 10; pp. 91497 - 91508
Main Authors	Song, Qingzeng, Zhang, Jiabing, Sun, Liankun, Jin, Guanghao
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Artificial neural networks Chips (memory devices) Computer architecture Convolutional neural networks Digital signal processing Digital signal processors Feature maps Field programmable gate arrays Frames per second Hardware acceleration Hardware accelerator Mathematical models Microprocessors multi-die Network latency Object detection Object recognition Power consumption Quantization (signal) YOLOv4-tiny
Online Access	Get full text

Cover

Loading…

More Information
Summary:	To achieve real-time object detection tasks with high throughput and low latency, this paper proposes a multi-die hardware accelerator architecture. It implements three accelerators on the VU9P chip, each of which is bound to an independent super logic region (SLR). To reduce off-chip memory access and power consumption, this design uses three on-chip buffers to store the weights and intermediate result data on one hand; on the other hand, it minimizes data access and movement and maximizes data reuse. This design uses an 8-bit quantization strategy for both weights and feature maps to achieve twice the throughput and computational efficiency of a single digital signal processor (DSP). In addition, many operators are designed in the accelerator, and all of them are fully parameterized, so it is easy to extend the network, and the control of the accelerator can be realized by configuring the instruction group. By accelerating the YOLOv4-tiny algorithm, the accelerator architecture can achieve a frame rate of 148.14 frames per second (FPS) and a peak throughput of 2.76 tera operations per second (TOPS) at 200 MHz with an energy efficiency ratio of 93.15 GOPS/W. The code can be found at https://github.com/19801201/Verilog_CNN_Accelerator .
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3199441