Parallelism-flexible Convolution Core for Sparse Convolutional Neural Networks on FPGA

The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip abili...

Full description

Saved in:

Bibliographic Details
Published in	IPSJ Transactions on System LSI Design Methodology Vol. 12; pp. 22 - 37
Main Authors	Sombatsiri, Salita, Shibata, Seiya, Kobayashi, Yuki, Inoue, Hiroaki, Takenaka, Takashi, Hosomi, Takeo, Yu, Jaehoon, Takeuchi, Yoshinori
Format	Journal Article
Language	English
Published	Tokyo Information Processing Society of Japan 01.01.2019 Japan Science and Technology Agency
Subjects	Accelerators Artificial neural networks CNN Computation Convolution flexible parallelism multi-parallelism Parallel processing Scheduling Sparsity Weight
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip ability. This paper proposes a convolution core for sparse CNN that leverages multiple types of parallelism and weight sparsity efficiently to achieve high performance. It alternates dataflow and scheduling of parallel computation according to the available parallelism of each convolutional layer by exploiting both intra- and inter-output parallelism to maximize multiplier utilization. In addition, it eliminates redundant multiply-accumulate (MACC) operations due to weight sparsity. The proposed convolution core enables both abilities with ease of dataflow control by using a parallelism controller for scheduling parallel MACCs on the processing elements (PEs) and a weight broadcaster for broadcasting non-zero weights to the PEs according to the scheduling. The proposed convolution core was evaluated on 13 convolutional layers in a sparse VGG-16 benchmark. It outperforms the baseline architecture for dense CNN that exploits intra-output parallelism by 4x speedup. It achieves 3x effective GMACS over prior arts of CNN accelerator in total performance.
ISSN:	1882-6687 1882-6687
DOI:	10.2197/ipsjtsldm.12.22