Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs

Deep learning workloads, such as convolutional neural networks (CNNs) are important due to increasingly demanding high-performance hardware acceleration. One distinguishing feature of a deep learning workload is that it is inherently resilient to small numerical errors and thus works very well with...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computer-aided design of integrated circuits and systems Vol. 38; no. 5; pp. 888 - 897
Main Authors	Lee, Sugil, Kim, Daewoo, Nguyen, Dong, Lee, Jongeun
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acceleration Accelerator architectures Accelerators Artificial neural networks Computation Computer simulation Convolution Convolutional neural network (CNN) Digital signal processing digital signal processing (DSP) block Field programmable gate arrays field-programmable gate array (FPGA) Gate arrays Hardware Machine learning multiply-and-accumulate (MAC) Neural networks reduced precision single-instruction multiple-data (SIMD) Table lookup Throughput Workload
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep learning workloads, such as convolutional neural networks (CNNs) are important due to increasingly demanding high-performance hardware acceleration. One distinguishing feature of a deep learning workload is that it is inherently resilient to small numerical errors and thus works very well with low precision hardware. We propose a novel method called double multiply-and-accumulate (MAC) to theoretically double the computation rate of CNN accelerators by packing two MAC operations into one digital signal processing block of off-the-shelf field-programmable gate arrays (FPGAs). We overcame several technical challenges by exploiting the mode of operation in the CNN accelerator. We have validated our method through FPGA synthesis and Verilog simulation, and evaluated our method by applying it to the state-of-the-art CNN accelerator. The double MAC approach used can double the computation throughput of a CNN layer. On the network level (all convolution layers combined), the performance improvement varies depending on the CNN application and FPGA size, from 14% to more than 80% over a highly optimized state-of-the-art accelerator solution, without sacrificing the output quality significantly.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2018.2824280