Engineering Boolean Matrix Multiplication for Multiple-Accelerator Shared-Memory Architectures

We study the problem of multiplying two bit matrices with entries either over the Boolean algebra \((0,1,\vee,\wedge)\) or over the binary field \((0,1,+,\cdot)\). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, wi...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Karppa, Matti, Kaski, Petteri
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 04.09.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We study the problem of multiplying two bit matrices with entries either over the Boolean algebra \((0,1,\vee,\wedge)\) or over the binary field \((0,1,+,\cdot)\). We engineer high-performance open-source algorithm implementations for contemporary multiple-accelerator shared-memory architectures, with the objective of time-and-energy-efficient scaling up to input sizes close to the available shared memory capacity. For example, given two terabinary-bit square matrices as input, our implementations compute the Boolean product in approximately 2100 seconds (1.0 Pbop/s at 3.3 pJ/bop for a total of 2.1 kWh/product) and the binary product in less than 950 seconds (2.4 effective Pbop/s at 1.5 effective pJ/bop for a total of 0.92 kWh/product) on an NVIDIA DGX-1 with power consumption at peak system power (3.5 kW). Our contributions are (a) for the binary product, we use alternative-basis techniques of Karstadt and Schwartz [SPAA '17] to design novel alternative-basis variants of Strassen's recurrence for \(2\times 2\) block multiplication [Numer. Math. 13 (1969)] that have been optimized for both the number of additions and low working memory, (b) structuring the parallel block recurrences and the memory layout for coalescent and register-localized execution on accelerator hardware, (c) low-level engineering of the innermost block products for the specific target hardware, and (d) structuring the top-level shared-memory implementation to feed the accelerators with data and integrate the results for input and output sizes beyond the aggregate memory capacity of the available accelerators.
ISSN:2331-8422