A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture

This brief presents a memory-efficient CNN accelerator design for resource-constrained devices in Internet of Things (IoT) and autonomous systems. A segmented logarithmic (SegLog) quantization method is exploited to mitigate the on-chip memory and bandwidth requirements, thus accommodating more proc...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems. II, Express briefs Vol. 68; no. 6; pp. 2142 - 2146
Main Authors	Xu, Jiawei, Huan, Yuxiang, Huang, Boming, Chu, Haoming, Jin, Yi, Zheng, Li-Rong, Zou, Zhuo
Format	Journal Article
Language	English
Published	New York IEEE 01.06.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accelerators Application specific integrated circuits Chips (memory devices) Clusters CMOS Compression tests Computer architecture Convolutional neural network (CNN) Data models dataflow Efficiency Field programmable gate arrays Internet of Things Measurement Memory management memory-efficient accelerator Optimization quantization Quantization (signal) System-on-chip
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This brief presents a memory-efficient CNN accelerator design for resource-constrained devices in Internet of Things (IoT) and autonomous systems. A segmented logarithmic (SegLog) quantization method is exploited to mitigate the on-chip memory and bandwidth requirements, thus accommodating more processing elements (PEs) in a given chip area to organize a reconfigurable multi-cluster architecture. The evaluation results show that SegLog quantization can achieve <inline-formula> <tex-math notation="LaTeX">6.4\times </tex-math></inline-formula> model compression with less than 2.5% accuracy loss on various CNNs. An ASIC implementation with 168 PEs configuration is validated in a 40-nm CMOS process, with 2.54 TOPs/W energy efficiency and 0.8 mm 2 chip area reported. The accelerator has also been implemented on FPGA with 1512 PEs configured and 468 kB on-chip memory, achieving a 1.29 GOPs/kB memory efficiency. Compared with the state-of-the-art accelerators, our ASIC implementation enhances area efficiency and arithmetic intensity by <inline-formula> <tex-math notation="LaTeX">1.94\times </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">5.62\times </tex-math></inline-formula>, while the FPGA implementation achieves the memory efficiency improvement by a factor of <inline-formula> <tex-math notation="LaTeX">2.34\times </tex-math></inline-formula>.
ISSN:	1549-7747 1558-3791 1558-3791
DOI:	10.1109/TCSII.2020.3038897