HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference Applications

Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, whi...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on very large scale integration (VLSI) systems Vol. 31; no. 12; pp. 1980 - 1993
Main Authors Zhao, Wenzhe, Yang, Guoming, Xia, Tian, Chen, Fei, Zheng, Nanning, Ren, Pengju
Format Journal Article
LanguageEnglish
Published New York IEEE 01.12.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, which cannot cover the flexible operators emerging in modern neural network algorithms. Therefore, fine-grained accelerators, such as instruction set architecture (ISA)-based accelerators, have become a hot research topic due to their sufficient flexibility to cover the unpredefined operators. The main challenges for fine-grained accelerators include the undesired long delays of single-image inference when performing multibatch inference, as well as the difficulty of meeting real-time constraints when processing multiple tasks simultaneously. This article proposes a hybrid intelligent processing unit (HIPU) to address the aforementioned problems. Specifically, we design a novel conversion-free data format, expanding the single-instruction multiple-data (SIMD) instruction set and optimizing the microarchitecture design to improve the performance. We also arrange the inference schedule to guarantee scalability on multicores. The experimental results show that the proposed accelerator maintains high multiply-accumulation (MAC) utilization for all common operators and achieves high performance with 4-<inline-formula> <tex-math notation="LaTeX">7\times </tex-math></inline-formula> speedup against NVIDIA RTX2080Ti GPU. Finally, the proposed accelerator is manufactured using TSMC 28-nm technology, achieving 1 GHz for each core, with a peak performance of 13 TOPS.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2023.3327110