NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference
The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs t...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
23.05.2023
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2305.14405 |
Cover
Summary: | The inherent diversity of computation types within the deep neural network
(DNN) models often requires a variety of specialized units in hardware
processors, which limits computational efficiency, increasing both inference
latency and power consumption, especially when the hardware processor needs to
support and execute different neural networks. In this study, we introduce
NeuralMatrix, which elastically transforms the computations of entire DNNs into
linear matrix operations. This transformation allows seamless execution of
various DNN models all with matrix operations and paves the way for running
versatile DNN models with a single General Matrix Multiplication (GEMM)
accelerator.Extensive experiments with both CNN and transformer-based models
demonstrate the potential of NeuralMatrix to accurately and efficiently execute
a wide range of DNN models, achieving 2.17-38.72 times computation efficiency
(i.e., throughput per power) compared to CPUs, GPUs, and SoC platforms. This
level of efficiency is usually only attainable with the accelerator designed
for a specific neural network. |
---|---|
DOI: | 10.48550/arxiv.2305.14405 |