Fast Sparse Matrix-Vector Multiplication by Exploiting Variable Block Structure

We improve the performance of sparse matrix-vector multiplication(SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this structure. We split the matrix...

Full description

Saved in:

Bibliographic Details
Published in	High Performance Computing and Communications pp. 807 - 816
Main Authors	Vuduc, Richard W., Moon, Hyun-Jin
Format	Book Chapter Conference Proceeding
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2005 Springer
Edition	1ère éd
Series	Lecture Notes in Computer Science
Subjects	Applied sciences Block Size Cache Blocking Compression Ratio Computer science; control theory; systems Computer systems and distributed systems. User interface Dense Block Exact sciences and technology Software Sparse Matrix Finite element method High performance Alignment Cache memory Bandwidth Matrix calculus Sparse matrix Tiling Data structure Distributed computing Modeling Execution time
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We improve the performance of sparse matrix-vector multiplication(SpMV) on modern cache-based superscalar machines when the matrix structure consists of multiple, irregularly aligned rectangular blocks. Matrices from finite element modeling applications often have this structure. We split the matrix, A, into a sum, A1 + A2 + ... + As, where each term is stored in a new data structure we refer to as unaligned block compressed sparse row (UBCSR) format. A classical approach which stores A in a BCSR can also reduce execution time, but the improvements may be limited because BCSR imposes an alignment of the matrix non-zeros that leads to extra work from filled-in zeros. Combining splitting with UBCSR reduces this extra work while retaining the generally lower memory bandwidth requirements and register-level tiling opportunities of BCSR. We show speedups can be as high as 2.1× over no blocking, and as high as 1.8× over BCSR as used in prior work on a set of application matrices. Even when performance does not improve significantly, split UBCSR usually reduces matrix storage.
ISBN:	9783540290315 3540290311
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11557654_91