NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs t...

Full description

Saved in:
Bibliographic Details
Main Authors Sun, Ruiqi, Ye, Siwei, Zhao, Jie, He, Xin, Lin, Jianzhe, Li, Yiran, Zou, An
Format Journal Article
LanguageEnglish
Published 23.05.2023
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2305.14405

Cover

More Information
Summary:The inherent diversity of computation types within the deep neural network (DNN) models often requires a variety of specialized units in hardware processors, which limits computational efficiency, increasing both inference latency and power consumption, especially when the hardware processor needs to support and execute different neural networks. In this study, we introduce NeuralMatrix, which elastically transforms the computations of entire DNNs into linear matrix operations. This transformation allows seamless execution of various DNN models all with matrix operations and paves the way for running versatile DNN models with a single General Matrix Multiplication (GEMM) accelerator.Extensive experiments with both CNN and transformer-based models demonstrate the potential of NeuralMatrix to accurately and efficiently execute a wide range of DNN models, achieving 2.17-38.72 times computation efficiency (i.e., throughput per power) compared to CPUs, GPUs, and SoC platforms. This level of efficiency is usually only attainable with the accelerator designed for a specific neural network.
DOI:10.48550/arxiv.2305.14405