HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference Applications

Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, whi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on very large scale integration (VLSI) systems Vol. 31; no. 12; pp. 1980 - 1993
Main Authors	Zhao, Wenzhe, Yang, Guoming, Xia, Tian, Chen, Fei, Zheng, Nanning, Ren, Pengju
Format	Journal Article
Language	English
Published	New York IEEE 01.12.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accelerators Algorithms Artificial neural networks Convolution Design optimization Inference Inference algorithms Matrix converters Network-on-chip (NoC) neural network (NN) inference accelerating Neural networks Operators out-of-order (OoO) superscalar processor Performance enhancement Real time Real-time systems reduced instruction set architecture Schedules Task analysis
Online Access	Get full text
ISSN	1063-8210 1557-9999
DOI	10.1109/TVLSI.2023.3327110

Cover

Loading…

Abstract	Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, which cannot cover the flexible operators emerging in modern neural network algorithms. Therefore, fine-grained accelerators, such as instruction set architecture (ISA)-based accelerators, have become a hot research topic due to their sufficient flexibility to cover the unpredefined operators. The main challenges for fine-grained accelerators include the undesired long delays of single-image inference when performing multibatch inference, as well as the difficulty of meeting real-time constraints when processing multiple tasks simultaneously. This article proposes a hybrid intelligent processing unit (HIPU) to address the aforementioned problems. Specifically, we design a novel conversion-free data format, expanding the single-instruction multiple-data (SIMD) instruction set and optimizing the microarchitecture design to improve the performance. We also arrange the inference schedule to guarantee scalability on multicores. The experimental results show that the proposed accelerator maintains high multiply-accumulation (MAC) utilization for all common operators and achieves high performance with 4-<inline-formula> <tex-math notation="LaTeX">7\times </tex-math></inline-formula> speedup against NVIDIA RTX2080Ti GPU. Finally, the proposed accelerator is manufactured using TSMC 28-nm technology, achieving 1 GHz for each core, with a peak performance of 13 TOPS.
AbstractList	Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, which cannot cover the flexible operators emerging in modern neural network algorithms. Therefore, fine-grained accelerators, such as instruction set architecture (ISA)-based accelerators, have become a hot research topic due to their sufficient flexibility to cover the unpredefined operators. The main challenges for fine-grained accelerators include the undesired long delays of single-image inference when performing multibatch inference, as well as the difficulty of meeting real-time constraints when processing multiple tasks simultaneously. This article proposes a hybrid intelligent processing unit (HIPU) to address the aforementioned problems. Specifically, we design a novel conversion-free data format, expanding the single-instruction multiple-data (SIMD) instruction set and optimizing the microarchitecture design to improve the performance. We also arrange the inference schedule to guarantee scalability on multicores. The experimental results show that the proposed accelerator maintains high multiply–accumulation (MAC) utilization for all common operators and achieves high performance with 4–[Formula Omitted] speedup against NVIDIA RTX2080Ti GPU. Finally, the proposed accelerator is manufactured using TSMC 28-nm technology, achieving 1 GHz for each core, with a peak performance of 13 TOPS. Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in practical scenarios. Coarse-grained accelerators achieve high performance but can support only a limited number of predesigned operators, which cannot cover the flexible operators emerging in modern neural network algorithms. Therefore, fine-grained accelerators, such as instruction set architecture (ISA)-based accelerators, have become a hot research topic due to their sufficient flexibility to cover the unpredefined operators. The main challenges for fine-grained accelerators include the undesired long delays of single-image inference when performing multibatch inference, as well as the difficulty of meeting real-time constraints when processing multiple tasks simultaneously. This article proposes a hybrid intelligent processing unit (HIPU) to address the aforementioned problems. Specifically, we design a novel conversion-free data format, expanding the single-instruction multiple-data (SIMD) instruction set and optimizing the microarchitecture design to improve the performance. We also arrange the inference schedule to guarantee scalability on multicores. The experimental results show that the proposed accelerator maintains high multiply-accumulation (MAC) utilization for all common operators and achieves high performance with 4-<inline-formula> <tex-math notation="LaTeX">7\times </tex-math></inline-formula> speedup against NVIDIA RTX2080Ti GPU. Finally, the proposed accelerator is manufactured using TSMC 28-nm technology, achieving 1 GHz for each core, with a peak performance of 13 TOPS.
Author	Chen, Fei Zheng, Nanning Xia, Tian Ren, Pengju Yang, Guoming Zhao, Wenzhe
Author_xml	– sequence: 1 givenname: Wenzhe orcidid: 0000-0002-7001-2125 surname: Zhao fullname: Zhao, Wenzhe organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China – sequence: 2 givenname: Guoming surname: Yang fullname: Yang, Guoming organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China – sequence: 3 givenname: Tian orcidid: 0000-0002-2520-3731 surname: Xia fullname: Xia, Tian organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China – sequence: 4 givenname: Fei surname: Chen fullname: Chen, Fei organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China – sequence: 5 givenname: Nanning surname: Zheng fullname: Zheng, Nanning organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China – sequence: 6 givenname: Pengju orcidid: 0000-0003-1163-2014 surname: Ren fullname: Ren, Pengju email: pengjuren@xjtu.edu.cn organization: National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, the National Engineering Research Center of Visual Information and Applications, and the Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, Xi'an, Shaanxi, China
BookMark	eNp9kDtPwzAUhS1UJNrCH0AMlphT7Dgvs1WFPqQKEKQwRk58U1xSJ9iuUP89hjIgBrwc6-qc-_gGqKdbDQidUzKilPCr_Hn5tBiFJGQjxsLU145Qn8ZxGnD_ev5PEhZkISUnaGDthhAaRZz00X6-eFhd4zGe70ujJF5oB02j1qAdfjBtBdYqvcYrrRx-Ue4VT5WGYGaEF-9-GuO6NfgRRBPkagv4BqDDd7AzovHiPlrz5nvWYEBXgMdd16hKONVqe4qOa9FYOPvRIcqnt_lkHizvZ4vJeBlUIU9ckJJIRoIKwpJSZgLKKiUlS1IRlzROJTAueRaWjCdMyqyStUyZEBRIxmTEUzZEl4e2nWnfd2BdsWl3RvuJRZjxmFAa0cS7woOrMq21BuqiM2orzL6gpPgiXHwTLr4IFz-EfSj7E6qU-z7OeT7N_9GLQ1QBwK9ZjDLq1_kEdz2LWg
CODEN	IEVSE9
CitedBy_id	crossref_primary_10_1109_TVLSI_2024_3466224 crossref_primary_10_1109_TVLSI_2025_3527225
Cites_doi	10.1109/AVSS.2019.8909903 10.1109/HOTCHIPS.2019.8875654 10.1109/ISCAS46773.2023.10181985 10.1109/CVPR.2018.00286 10.1109/MM.2020.2975764 10.1109/ASAP52443.2021.00046 10.1109/TC.2016.2574353 10.1109/ISCA45697.2020.00013 10.1109/JSSC.2022.3198505 10.1109/TVLSI.2019.2950087 10.1109/HPCA51647.2021.00071 10.1145/3007787.3001179 10.1145/3240765.3240855 10.1109/ICASID.2018.8693202 10.1145/2996864 10.1002/rob.21918 10.1145/3568310 10.1109/CVPRW50498.2020.00187 10.1145/3065386 10.1109/JSSC.2022.3214170 10.1109/HCS49909.2020.9220415 10.3390/s19020281 10.1145/3079856.3080246 10.1109/TVLSI.2019.2935251
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023
DBID	97E RIA RIE AAYXX CITATION 7SP 8FD L7M
DOI	10.1109/TVLSI.2023.3327110
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library CrossRef Electronics & Communications Abstracts Technology Research Database Advanced Technologies Database with Aerospace
DatabaseTitle	CrossRef Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1557-9999
EndPage	1993
ExternalDocumentID	10_1109_TVLSI_2023_3327110 10313116
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 62302381; 62088102 funderid: 10.13039/501100001809 – fundername: Fundamental Research Funds for the Central Universities grantid: xtr072022001 funderid: 10.13039/501100012226 – fundername: National Key Research and Development Program of China grantid: 2022YFB4500500 funderid: 10.13039/501100012166 – fundername: Key Research and Development Projects of Shaanxi Province; Key Research and Development Program of Shaanxi grantid: 2022ZDLGY01-08 funderid: 10.13039/501100015401
GroupedDBID	-~X .DC 0R~ 29I 3EH 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS TN5 VH1 AAYOK AAYXX CITATION RIG 7SP 8FD L7M
ID	FETCH-LOGICAL-c296t-704d4a1a036bd8aebc70b367a5b157de39d982b3963dd8cdfd73aa1e083d4973
IEDL.DBID	RIE
ISSN	1063-8210
IngestDate	Mon Jun 30 06:35:47 EDT 2025 Tue Jul 01 02:17:51 EDT 2025 Thu Apr 24 22:51:13 EDT 2025 Wed Aug 27 02:37:31 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	12
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c296t-704d4a1a036bd8aebc70b367a5b157de39d982b3963dd8cdfd73aa1e083d4973
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-7001-2125 0000-0003-1163-2014 0000-0002-2520-3731
PQID	2895011416
PQPubID	85424
PageCount	14
ParticipantIDs	crossref_citationtrail_10_1109_TVLSI_2023_3327110 ieee_primary_10313116 crossref_primary_10_1109_TVLSI_2023_3327110 proquest_journals_2895011416
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2023-12-01
PublicationDateYYYYMMDD	2023-12-01
PublicationDate_xml	– month: 12 year: 2023 text: 2023-12-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on very large scale integration (VLSI) systems
PublicationTitleAbbrev	TVLSI
PublicationYear	2023
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref12 ref15 ref31 ref30 ref11 ref10 ref1 ref17 ref16 ref19 ref18 Asanović (ref25) 2014 (ref8) 2022 Askarihemmat (ref32) ref24 ref23 ref20 (ref14) 2022 ref22 ref21 Chen (ref27) ref28 ref29 ref7 Schmidt (ref26) ref9 ref4 ref3 ref6 ref5 Redmon (ref2) 2018
References_xml	– ident: ref4 doi: 10.1109/AVSS.2019.8909903 – ident: ref23 doi: 10.1109/HOTCHIPS.2019.8875654 – ident: ref29 doi: 10.1109/ISCAS46773.2023.10181985 – ident: ref24 doi: 10.1109/CVPR.2018.00286 – ident: ref19 doi: 10.1109/MM.2020.2975764 – ident: ref21 doi: 10.1109/ASAP52443.2021.00046 – start-page: 483 volume-title: Proc. 28th Asia South Pacific Design Autom. Conf. (ASP-DAC) ident: ref32 article-title: BARVINN: Arbitrary precision DNN accelerator controlled by a RISC-V CPU – ident: ref11 doi: 10.1109/TC.2016.2574353 – ident: ref16 doi: 10.1109/ISCA45697.2020.00013 – ident: ref20 doi: 10.1109/JSSC.2022.3198505 – year: 2018 ident: ref2 article-title: YOLOv3: An incremental improvement publication-title: arXiv:1804.02767 – ident: ref31 doi: 10.1109/TVLSI.2019.2950087 – ident: ref30 doi: 10.1109/HPCA51647.2021.00071 – year: 2014 ident: ref25 article-title: Instruction sets should be free: The case for RISC-V – ident: ref12 doi: 10.1145/3007787.3001179 – start-page: 1 volume-title: Proc. Inaugural RISC-V Summit ident: ref26 article-title: Hwacha V4: Decoupled data parallel custom extension – ident: ref9 doi: 10.1145/3240765.3240855 – ident: ref6 doi: 10.1109/ICASID.2018.8693202 – ident: ref13 article-title: Intel architecture instruction set extensions programming reference publication-title: Intel Corp – volume-title: Core ML: Integrate Machine Learningmodels Into Your App year: 2022 ident: ref14 – ident: ref10 doi: 10.1145/2996864 – ident: ref5 doi: 10.1002/rob.21918 – ident: ref18 doi: 10.1145/3568310 – ident: ref28 doi: 10.1109/CVPRW50498.2020.00187 – ident: ref1 doi: 10.1145/3065386 – ident: ref22 doi: 10.1109/JSSC.2022.3214170 – volume-title: Imagination year: 2022 ident: ref8 – start-page: 578 volume-title: Proc. 13th USENIX Symp. Operating Syst. Design Implement. (OSDI) ident: ref27 article-title: TVM: An automated end-to-end optimizing compiler for deep learning – ident: ref15 doi: 10.1109/HCS49909.2020.9220415 – ident: ref3 doi: 10.3390/s19020281 – ident: ref7 doi: 10.1145/3079856.3080246 – ident: ref17 doi: 10.1109/TVLSI.2019.2935251
SSID	ssj0014490
Score	2.4283621
Snippet	Neural network algorithms have shown superior performance over conventional algorithms, leading to the designation and deployment of dedicated accelerators in...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	1980
SubjectTerms	Accelerators Algorithms Artificial neural networks Convolution Design optimization Inference Inference algorithms Matrix converters Network-on-chip (NoC) neural network (NN) inference accelerating Neural networks Operators out-of-order (OoO) superscalar processor Performance enhancement Real time Real-time systems reduced instruction set architecture Schedules Task analysis
Title	HIPU: A Hybrid Intelligent Processing Unit With Fine-Grained ISA for Real-Time Deep Neural Network Inference Applications
URI	https://ieeexplore.ieee.org/document/10313116 https://www.proquest.com/docview/2895011416
Volume	31
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4IJz34xIii2YM309Jlt9vWG1ERjBIjoNya3e4SjQaIlgP-emfaQohG461JZx_J7OObnflmCDllCTcjKWGnWaEcgZ5CrRIwXBVXDQn41hh80L_ryvZA3Az9YUFWz7gw1tos-My6-Jn58s0kmeFTWR1LEnDGZImUwHLLyVpLl4EQUZ56AIYLwZBZMGS8qN5_vO11XCwU7nLeCBjSZVduoaysyo-zOLtgWluku5haHlfy6s5S7Saf37I2_nvu22SzgJq0ma-NHbJmx7tkYyUB4R6Ztzv3g3PapO05MrdoZ5mgM6UFhQDkKAJT-vSSPtMWtHWusayEBelekwLmpQ8ANh3kktBLa6cU833AuN08wBz6LCiFtLniLK-Qfuuqf9F2imIMTtKIZOoEnjBCMQU3njahsjoJPM1loHzN_MBYHpkobGgOG9pgRaSRCbhSzALEMyIK-D4pjydje0ColkKyUEk4SeBPiK8eiQ4ZG3kB-o29KmEL3cRJkagc62W8xZnB4kVxps8Y9RkX-qySs2WbaZ6m40_pCipoRTLXTZXUFmsgLrbyRwwWqY9WI5OHvzQ7IuvYex7kUiPl9H1mjwGqpPokW6Jf-CLiUg
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT-MwEB4tcAAOvBEFFnzYG0qIa8dJuFVASaFUKyiPW2THrnYFKgjSA_x6ZpK0qliB9hYp44c0fsx45vsG4BfPhR0ohTvNSe1JihQanaPjqoVuKrRvraUH_cueSm_k-X14X4PVSyyMc65MPnM-fZaxfPuUj-ip7JBKEgjO1QzMhYTGreBak6CBlElFPoADxujKjDEyQXLYv-1ed3wqFe4L0Yw4AWan7qGysMo_p3F5xbSXoTeeXJVZ8uCPCuPn7594G_979iuwVBubrFWtjlX44YZrsDhFQbgOb2nn980Ra7H0jbBbrDOh6CxYDSJAOUamKbv7W_xhbWzrnVFhCYfS1y2GVi-7QnPTIzQJO3HumRHjB47bq1LMsc8aVMhaU-HyDei3T_vHqVeXY_DyZqIKLwqklZprvPOMjbUzeRQYoSIdGh5G1onEJnHTCNzSlmoiDWwktOYOjTwrk0hswuzwaei2gBklFY-1wrME_8T07pGbmPNBEFHkOGgAH-smy2uqcqqY8ZiVLkuQZKU-M9JnVuuzAQeTNs8VUce30hukoCnJSjcN2B2vgazezK8Z-qQh-Y1cbX_RbB_m0_5lN-t2ehc7sEAjVSkvuzBbvIzcTzRcCrNXLtcP7iXlmg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=HIPU%3A+A+Hybrid+Intelligent+Processing+Unit+With+Fine-Grained+ISA+for+Real-Time+Deep+Neural+Network+Inference+Applications&rft.jtitle=IEEE+transactions+on+very+large+scale+integration+%28VLSI%29+systems&rft.au=Zhao%2C+Wenzhe&rft.au=Yang%2C+Guoming&rft.au=Xia%2C+Tian&rft.au=Chen%2C+Fei&rft.date=2023-12-01&rft.issn=1063-8210&rft.eissn=1557-9999&rft.volume=31&rft.issue=12&rft.spage=1980&rft.epage=1993&rft_id=info:doi/10.1109%2FTVLSI.2023.3327110&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TVLSI_2023_3327110
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-8210&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-8210&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-8210&client=summon