Reusing GEMM Hardware for Efficient Execution of Depthwise Separable Convolution on ASIC-Based DNN Accelerators

Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing...

Full description

Saved in:

Bibliographic Details
Published in	2023 28th Asia and South Pacific Design Automation Conference (ASP-DAC) pp. 475 - 482
Main Authors	Manasi, Susmita Dey, Banerjee, Suvadeep, Davare, Abhijit, Sorokin, Anton A., Burns, Steven M., Kirkpatrick, Desmond A., Sapatnekar, Sachin S.
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 16.01.2023
Series	ACM Conferences
Subjects	Asia CCS CONCEPTS Hardware → Application specific integrated circuits Computing methodologies > Machine learning > Machine learning approaches > Neural networks Computing methodologies → Neural networks Convolution Deep learning deep learning accelerator depthwise convolution Design automation Graphics processing units Hardware Hardware > Very large scale integration design > Application-specific VLSI designs > Application specific integrated circuits lightweight CNN Performance gain depthwise convolution lightweight CNN deep learning accelerator
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep learning (DL) accelerators are optimized for standard convolution. However, lightweight convolutional neural networks (CNNs) use depthwise convolution (DwC) in key layers, and the structural difference between DwC and standard convolution leads to significant performance bottleneck in executing lightweight CNNs on such platforms. This work reuses the fast general matrix-vector multiplication (GEMM) core of DL accelerators by mapping DwC to channel-wise parallel matrix-vector multiplications. An analytical framework is developed to guide pre-RTL hardware choices, and new hardware modules and software support are developed for end-to-end evaluation of the solution. This GEMM-based DwC execution strategy offers substantial performance gains for lightweight CNNs: 7× speedup and 1.8× lower off-chip communication for MobileNet-v1 over a conventional DL accelerator, and 74× speedup over a CPU, and even 1.4× speedup over a power-hungry GPU.
ISBN:	9781450397834 1450397832
ISSN:	2153-697X
DOI:	10.1145/3566097.3567863