SparseTrim: A Neural Network Accelerator Featuring On-Chip Decompression of Fine-Grained Sparse Model with 10.1TOPS/W System Energy Efficiency

The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the Custom Integrated Circuits Conference pp. 1 - 3
Main Authors Li, Jieyu, He, Weifeng, Jiang, Boran, Wang, Xinyu, He, Guanghui, Liu, Dingxuan, Seok, Mingoo
Format Conference Proceeding
LanguageEnglish
Published IEEE 13.04.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]. This allows accelerators to load the compressed sparse weights from off-chip memory and decompress them just before computation, thereby greatly reducing the off-chip weight loading cost. Prior sparse weight compression methods, however, suffer from low compression ratio or slow sequential decompression. The former limits the reduction of off-chip weight loading cost, and the latter is not suitable for the massively parallel computation of NN accelerators.
ISSN:2152-3630
DOI:10.1109/CICC63670.2025.10982861