Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors h...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Mohamed Assem Ibrahim, Islam, Mahzabeen, Shaizeen Aga
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 01.04.2024
Subjects	Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Hardware Architecture Generative artificial intelligence Knobs Mathematical analysis Matrices (mathematics) Microprocessors Placement
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe in this work that a key impediment to truly harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out several factors that impact data-placement and propose PIMnast methodology which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, our proposed PIMnast methodology along with additional orchestration knobs we identify delivers up to 6.86\(\times\) speedup for GEMVs (of the available 7\(\times\) roofline speedup) leading to up to 5\(\times\) speedup for per-token latencies.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2403.20297