A CNN Accelerator on FPGA Using Depthwise Separable Convolution

Convolutional neural networks (CNNs) have been widely deployed in the fields of computer vision and pattern recognition because of their high accuracy. However, large convolution operations are computing intensive and often require a powerful computing platform such as a graphics processing unit. Th...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems. II, Express briefs Vol. 65; no. 10; pp. 1415 - 1419
Main Authors	Bai, Lin, Zhao, Yiming, Huang, Xinming
Format	Journal Article
Language	English
Published	New York IEEE 01.10.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adders Artificial neural networks Bandwidth Computation Computer vision Convolution Convolutional neural network Engines Field programmable gate arrays FPGA Frames per second hardware accelerator MobileNetV2 Model accuracy Neural networks Pattern recognition Portable equipment System-on-chip
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Convolutional neural networks (CNNs) have been widely deployed in the fields of computer vision and pattern recognition because of their high accuracy. However, large convolution operations are computing intensive and often require a powerful computing platform such as a graphics processing unit. This makes it difficult to apply CNNs to portable devices. The state-of-the-art CNNs, such as MobileNetV2 and Xception, adopt depthwise separable convolution to replace the standard convolution for embedded platforms, which significantly reduces operations and parameters with only limited loss in accuracy. This highly structured model is very suitable for field-programmable gate array (FPGA) implementation. In this brief, a scalable high performance depthwise separable convolution optimized CNN accelerator is proposed. The accelerator can be fit into an FPGA of different sizes, provided the balancing between hardware resources and processing speed. As an example, MobileNetV2 is implemented on Arria 10 SoC FPGA, and the results show this accelerator can classify each picture from ImageNet in 3.75 ms, which is about 266.6 frames per second. The FPGA design achieves 20x speedup if compared to CPU.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1549-7747 1558-3791
DOI:	10.1109/TCSII.2018.2865896