Bandwidth-Efficient Sparse Matrix Multiplier Architecture for Deep Neural Networks on FPGA

Deep neural networks (DNNs) are promising solutions for most of the artificial intelligence and machine learning applications in various fields like safety and transportation, medical field, weather forecasting and many more. State-of-the-art deep neural networks can have hundreds of millions of par...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE 34th International System-on-Chip Conference (SOCC) pp. 7 - 12
Main Authors	M, Mahesh, S, Nalesh, S, Kala
Format	Conference Proceeding
Language	English
Published	IEEE 14.09.2021
Subjects	Bandwidth Bandwidth efficiency Computer architecture Deep learning FPGA Neural networks Performance gain Safety Sparse DNN Transportation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep neural networks (DNNs) are promising solutions for most of the artificial intelligence and machine learning applications in various fields like safety and transportation, medical field, weather forecasting and many more. State-of-the-art deep neural networks can have hundreds of millions of parameters, and it makes them less than ideal for mass adoption in devices with constrained memory and power requirements like edge computing devices and mobile devices. Techniques like quantization and inducing sparsity, aims to reduce the total number of computations needed for deep learning inference. General purpose computing hardware like CPUs (Central Processing Units) and GPUs (Graphic Processing Units) are not optimized for portable embedded applications as they are not energy efficient. Field Programmable Gate Arrays (FPGAs) are suitable candidates for edge computing, which delivers decent power consumption, with flexibility. We propose an efficient sparse matrix-vector multiplication (SpMV) architecture that aims to make deep learning inference faster and more efficient and also reduces memory-bandwidth bottleneck. The multiplication process is handled by multiple multiply and accumulate (MAC) channels and the architecture can use the maximum available memory bandwidth of the computing device. The proposed sparse matrix multiplier architecture has been implemented on Zynq Ultrascale+ FPGA with an operating frequency of 270 MHz and gave performance gain upto 5× when compared with existing implementation.
ISSN:	2164-1706
DOI:	10.1109/SOCC52499.2021.9739346