OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator

Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfi...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) pp. 724 - 736
Main Authors Pal, Subhankar, Beaumont, Jonathan, Park, Dong-Hyeon, Amarnath, Aporva, Feng, Siying, Chakrabarti, Chaitali, Kim, Hun-Seok, Blaauw, David, Mudge, Trevor, Dreslinski, Ronald
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.02.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Sparse matrices are widely used in graph and data analytics, machine learning, engineering and scientific applications. This paper describes and analyzes OuterSPACE, an accelerator targeted at applications that involve large sparse matrices. OuterSPACE is a highly-scalable, energy-efficient, reconfigurable design, consisting of massively parallel Single Program, Multiple Data (SPMD)-style processing units, distributed memories, high-speed crossbars and High Bandwidth Memory (HBM). We identify redundant memory accesses to non-zeros as a key bottleneck in traditional sparse matrix-matrix multiplication algorithms. To ameliorate this, we implement an outer product based matrix multiplication technique that eliminates redundant accesses by decoupling multiplication from accumulation. We demonstrate that traditional architectures, due to limitations in their memory hierarchies and ability to harness parallelism in the algorithm, are unable to take advantage of this reduction without incurring significant overheads. OuterSPACE is designed to specifically overcome these challenges. We simulate the key components of our architecture using gem5 on a diverse set of matrices from the University of Florida's SuiteSparse collection and the Stanford Network Analysis Project and show a mean speedup of 7.9× over Intel Math Kernel Library on a Xeon CPU, 13.0× against cuSPARSE and 14.0× against CUSP when run on an NVIDIA K40 GPU, while achieving an average throughput of 2.9 GFLOPS within a 24 W power budget in an area of 87 mm2.
ISSN:2378-203X
DOI:10.1109/HPCA.2018.00067