A 7.3 M Output Non-Zeros/J, 11.7 M Output Non-Zeros/GB Reconfigurable Sparse Matrix-Matrix Multiplication Accelerator

A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip mem...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of solid-state circuits Vol. 55; no. 4; pp. 933 - 944
Main Authors Park, Dong-Hyeon, Pal, Subhankar, Feng, Siying, Gao, Paul, Tan, Jielun, Rovinski, Austin, Xie, Shaolin, Zhao, Chun, Amarnath, Aporva, Wesley, Timothy, Beaumont, Jonathan, Chen, Kuan-Yu, Chakrabarti, Chaitali, Taylor, Michael Bedford, Mudge, Trevor, Blaauw, David, Kim, Hun-Seok, Dreslinski, Ronald G.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:A sparse matrix-matrix multiplication (SpMM) accelerator with 48 heterogeneous cores and a reconfigurable memory hierarchy is fabricated in 40-nm CMOS. The compute fabric consists of dedicated floating-point multiplication units, and general-purpose Arm Cortex-M0 and Cortex-M4 cores. The on-chip memory reconfigures scratchpad or cache, depending on the phase of the algorithm. The memory and compute units are interconnected with synthesizable coalescing crossbars for efficient memory access. The 2.0-mm <inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> 2.6-mm chip exhibits 12.6<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (8.4<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) energy efficiency gain, 11.7<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (77.6<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) off-chip bandwidth efficiency gain, and 17.1<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> (36.9<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula>) compute density gain s against a high-end CPU (GPU) across a diverse set of synthetic and real-world power-law graph-based sparse matrices.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2019.2960480