SparseTrim: A Neural Network Accelerator Featuring On-Chip Decompression of Fine-Grained Sparse Model with 10.1TOPS/W System Energy Efficiency

The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the Custom Integrated Circuits Conference pp. 1 - 3
Main Authors	Li, Jieyu, He, Weifeng, Jiang, Boran, Wang, Xinyu, He, Guanghui, Liu, Dingxuan, Seok, Mingoo
Format	Conference Proceeding
Language	English
Published	IEEE 13.04.2025
Subjects	Artificial neural networks Computational modeling Costs Energy efficiency Integrated circuit modeling Load modeling Loading System-on-chip
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]. This allows accelerators to load the compressed sparse weights from off-chip memory and decompress them just before computation, thereby greatly reducing the off-chip weight loading cost. Prior sparse weight compression methods, however, suffer from low compression ratio or slow sequential decompression. The former limits the reduction of off-chip weight loading cost, and the latter is not suitable for the massively parallel computation of NN accelerators.
ISSN:	2152-3630
DOI:	10.1109/CICC63670.2025.10982861