Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computa...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computer-aided design of integrated circuits and systems Vol. 38; no. 11; pp. 2072 - 2085
Main Authors	Zhang, Chen, Sun, Guangyu, Fang, Zhenman, Zhou, Peipei, Pan, Peichen, Cong, Jason
Format	Journal Article
Language	English
Published	New York IEEE 01.11.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acceleration Accelerators Artificial neural networks Bandwidth Caffe Caffeine CNN FPGA engine Computer architecture convolutional neural network (CNN) Deep learning Energy efficiency Engines Evaluation Field programmable gate arrays Graphics processing units hardware/software co-design High level networks Kernel Machine learning Multilayers Multiplication Neural networks Performance enhancement Power efficiency Representations Resource utilization Software
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the recent advancement of multilayer convolutional neural networks (CNNs) and fully connected networks (FCNs), deep learning has achieved amazing success in many areas, especially in visual content understanding and classification. To improve the performance and energy efficiency of the computation-demanding CNN, the FPGA-based acceleration emerges as one of the most attractive alternatives. In this paper, we design and implement Caffeine, a hardware/software co-designed library to efficiently accelerate the entire CNN and FCN on FPGAs. First, we propose a uniformed convolutional matrix-multiplication representation for both computation-bound convolutional layers and communication-bound FCN layers. Based on this representation, we optimize the accelerator microarchitecture and maximize the underlying FPGA computing and bandwidth resource utilization based on a revised roofline model. Moreover, we design an automation flow to directly compile highlevel network definitions to the final FPGA accelerator. As a case study, we integrate Caffeine into the industry-standard software deep learning framework Caffe. We evaluate Caffeine and its integration with Caffe by implementing VGG16 and AlexNet networks on multiple FPGA platforms. Caffeine achieves a peak performance of 1460 giga fixed point operations per second on a medium-sized Xilinx KU060 FPGA board; to our knowledge, this is the best published result. It achieves more than 100× speedup on FCN layers over prior FPGA accelerators. An end-to-end evaluation with Caffe integration shows up to 29× and 150× performance and energy gains over Caffe on a 12-core Xeon server, and 5.7× better energy efficiency over the GPU implementation. Performance projections for a system with a high-end FPGA (Virtex7 690t) show even higher gains.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2017.2785257