Parallelism-flexible Convolution Core for Sparse Convolutional Neural Networks on FPGA

The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip abili...

Full description

Saved in:
Bibliographic Details
Published inIPSJ Transactions on System LSI Design Methodology Vol. 12; pp. 22 - 37
Main Authors Sombatsiri, Salita, Shibata, Seiya, Kobayashi, Yuki, Inoue, Hiroaki, Takenaka, Takashi, Hosomi, Takeo, Yu, Jaehoon, Takeuchi, Yoshinori
Format Journal Article
LanguageEnglish
Published Tokyo Information Processing Society of Japan 01.01.2019
Japan Science and Technology Agency
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The performance of recent CNN accelerators falls behind their peak performance because they fail to maximize parallel computation in every convolutional layer from the parallelism that varies throughout the CNN. Furthermore, the exploitation of multiple parallelisms may reduce calculation-skip ability. This paper proposes a convolution core for sparse CNN that leverages multiple types of parallelism and weight sparsity efficiently to achieve high performance. It alternates dataflow and scheduling of parallel computation according to the available parallelism of each convolutional layer by exploiting both intra- and inter-output parallelism to maximize multiplier utilization. In addition, it eliminates redundant multiply-accumulate (MACC) operations due to weight sparsity. The proposed convolution core enables both abilities with ease of dataflow control by using a parallelism controller for scheduling parallel MACCs on the processing elements (PEs) and a weight broadcaster for broadcasting non-zero weights to the PEs according to the scheduling. The proposed convolution core was evaluated on 13 convolutional layers in a sparse VGG-16 benchmark. It outperforms the baseline architecture for dense CNN that exploits intra-output parallelism by 4x speedup. It achieves 3x effective GMACS over prior arts of CNN accelerator in total performance.
ISSN:1882-6687
1882-6687
DOI:10.2197/ipsjtsldm.12.22