MAD MAcce: Supporting Multiply-Add Operations for Democratizing Matrix-Multiplication Accelerators

Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to accelerate matrix multiplication, the core computation of deep learning workloads. However, it is challenging to exploit the MXUs for GPGPU applications whose fundamental algorithms do not rely on matrix multiplication. Fu...

Full description

Saved in:

Bibliographic Details
Published in	2023 56th IEEE/ACM International Symposium on Microarchitecture (MICRO) pp. 367 - 379
Main Authors	Sung, Seunghwan, Hur, Sujin, Kim, Sungwoo, Ha, Dongho, Oh, Yunho, Ro, Won Woo
Format	Conference Proceeding
Language	English
Published	ACM 28.10.2023
Subjects	Codes Computer architecture GPU Hardware High Performance Computing Organizations Programming Tensor Cores Tensors Throughput
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Modern GPUs commonly employ specialized matrix multiplication units (MXUs) to accelerate matrix multiplication, the core computation of deep learning workloads. However, it is challenging to exploit the MXUs for GPGPU applications whose fundamental algorithms do not rely on matrix multiplication. Furthermore, an additional programming effort is necessary to tailor existing code or algorithms using dedicated APIs or libraries to utilize MXUs. Therefore, MXUs are often underutilized even when GPUs hunger for higher throughput.We observe that the intensive multiply-and-add (MAD) instructions often become bottlenecks in compute-intensive applications. Furthermore, such MAD instructions create computations similar to the dot-product operations of MXUs when they have data dependency. By leveraging these observations, we propose a novel MXU architecture called MAD MAcce that can handle both matrix multiplication and MAD operations. In our design, GPU compiler detects target MAD instructions by analyzing the instruction stream and generates new instructions for MAD Macce in a programmer-transparent manner. Then, MAD MAcce executes the newly generated instructions. By offloading MAD operations to the MXUs, GPUs can exploit the high throughput of MXUs for various domains without significant hardware modification or additional programming efforts. In our evaluation, MAD MAcce achieves up to 2.13× speedup and 1.65× average speedup in compute-intensive applications.CCS CONCEPTS* Computer systems organization → Single instruction multiple data.