Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

We study the problem of multiplying two bit matrices with entries either over the Boolean algebra \((0,1,\vee,\wedge)\) or over the binary field \((0,1,+,\cdot)\). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, wi...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Karppa, Matti, Kaski, Petteri
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 04.09.2019
Subjects	Accelerators Algorithms Boolean algebra Computer architecture Computer memory Hardware Matrices (mathematics) Multiplication Power consumption
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We study the problem of multiplying two bit matrices with entries either over the Boolean algebra \((0,1,\vee,\wedge)\) or over the binary field \((0,1,+,\cdot)\). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW). Our contributions are (a) for the binary product, we use alternative-basis techniques of Karstadt and Schwartz [SPAA '17] to design novel alternative-basis variants of Strassen's recurrence for \(2\times 2\) block multiplication [Numer. Math. 13 (1969)] that have been optimized for both the number of additions and low working memory, (b) structuring the parallel block recurrences and the memory layout for coalescent and register-localized execution on accelerator hardware, (c) low-level engineering of the innermost block products for the specific target hardware, and (d) structuring the top-level shared-memory implementation to feed the accelerators with data and integrate the results for input and output sizes beyond the aggregate memory capacity of the available accelerators.
ISSN:	2331-8422