A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator

A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip mem...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal of solid-state circuits Vol. 55; no. 4; pp. 933 - 944
Main Authors	Park, Dong-Hyeon, Pal, Subhankar, Feng, Siying, Gao, Paul, Tan, Jielun, Rovinski, Austin, Xie, Shaolin, Zhao, Chun, Amarnath, Aporva, Wesley, Timothy, Beaumont, Jonathan, Chen, Kuan-Yu, Chakrabarti, Chaitali, Taylor, Michael Bedford, Mudge, Trevor, Blaauw, David, Kim, Hun-Seok, Dreslinski, Ronald G.
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Bandwidth CMOS Coalescing Computer architecture Decoupled access execution Floating point arithmetic Indexes Kernel Multiplication reconfigurablility and accelerator Reconfiguration Sorting Sparse matrices sparse matrix multiplier Sparsity synthesizable crossbar System-on-chip
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm <inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> 2.6-mm chip exhibits 12.6<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (8.4<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) energy efficiency gain, 11.7<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (77.6<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) off-chip bandwidth efficiency gain, and 17.1<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (36.9<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.
ISSN:	0018-9200 1558-173X
DOI:	10.1109/JSSC.2019.2960480