SparseTrim: A Neural Network Accelerator Featuring On-Chip Decompression of Fine-Grained Sparse Model with 10.1TOPS/W System Energy Efficiency
The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]...
Saved in:
Published in | Proceedings of the Custom Integrated Circuits Conference pp. 1 - 3 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
13.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The large off-chip memory access cost coupled with ever-increasing model size has posed a major challenge to develop energy-efficient neural-network (NN) accelerators [1]. An effective approach is fine-grained model pruning followed by sparse weight compression to reduce the memory footprint [2]-[6]. This allows accelerators to load the compressed sparse weights from off-chip memory and decompress them just before computation, thereby greatly reducing the off-chip weight loading cost. Prior sparse weight compression methods, however, suffer from low compression ratio or slow sequential decompression. The former limits the reduction of off-chip weight loading cost, and the latter is not suitable for the massively parallel computation of NN accelerators. |
---|---|
ISSN: | 2152-3630 |
DOI: | 10.1109/CICC63670.2025.10982861 |