VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer

The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high compu...

Full description

Saved in:

Bibliographic Details
Main Authors	Sun, Mengshu, Ma, Haoyu, Kang, Guoliang, Jiang, Yifan, Chen, Tianlong, Ma, Xiaolong, Wang, Zhangyang, Wang, Yanzhi
Format	Journal Article
Language	English
Published	17.01.2022
Subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Abstract	The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.
AbstractList	The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently extended the application domains to various vision tasks. While achieving high performance, ViTs suffer from large model size and high computation complexity that hinders the deployment of them on edge devices. To achieve high throughput on hardware and preserve the model accuracy simultaneously, we propose VAQF, a framework that builds inference accelerators on FPGA platforms for quantized ViTs with binary weights and low-precision activations. Given the model structure and the desired frame rate, VAQF will automatically output the required quantization precision for activations as well as the optimized parameter settings of the accelerator that fulfill the hardware requirements. The implementations are developed with Vivado High-Level Synthesis (HLS) on the Xilinx ZCU102 FPGA board, and the evaluation results with the DeiT-base model indicate that a frame rate requirement of 24 frames per second (FPS) is satisfied with 8-bit activation quantization, and a target of 30 FPS is met with 6-bit activation quantization. To the best of our knowledge, this is the first time quantization has been incorporated into ViT acceleration on FPGAs with the help of a fully automatic framework to guide the quantization strategy on the software side and the accelerator implementations on the hardware side given the target frame rate. Very small compilation time cost is incurred compared with quantization training, and the generated accelerators show the capability of achieving real-time execution for state-of-the-art ViT models on FPGAs.
Author	Kang, Guoliang Sun, Mengshu Ma, Haoyu Jiang, Yifan Chen, Tianlong Wang, Zhangyang Ma, Xiaolong Wang, Yanzhi
Author_xml	– sequence: 1 givenname: Mengshu surname: Sun fullname: Sun, Mengshu – sequence: 2 givenname: Haoyu surname: Ma fullname: Ma, Haoyu – sequence: 3 givenname: Guoliang surname: Kang fullname: Kang, Guoliang – sequence: 4 givenname: Yifan surname: Jiang fullname: Jiang, Yifan – sequence: 5 givenname: Tianlong surname: Chen fullname: Chen, Tianlong – sequence: 6 givenname: Xiaolong surname: Ma fullname: Ma, Xiaolong – sequence: 7 givenname: Zhangyang surname: Wang fullname: Wang, Zhangyang – sequence: 8 givenname: Yanzhi surname: Wang fullname: Wang, Yanzhi
BackLink	https://doi.org/10.48550/arXiv.2201.06618$$DView paper in arXiv
BookMark	eNotj71OwzAYRT3AAIUHYMIv4ODf2GELgVCkSAi16sASfWlsZJHEyEkJfXvawnLv1RmudC7R2RAGi9ANo4k0StE7iD_-O-GcsoSmKTMX6H2Tv5X3uNx13R7nuyn0MPktXgU3zRAtWUJsjwMXgTza0X8MuIzQ2znET-xCxFWYyYOf8MaPPgx4HWEYD7y38QqdO-hGe_3fC7Qqn9bFklSvzy9FXhFItSFGS6WYkkofQvAsYw0HS7NGp651Tcq5sm3GNXdMaCakM0pmoC3fMit0Ixbo9u_15FZ_Rd9D3NdHx_rkKH4BJCpMvg
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2201.06618
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2201_06618
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a678-874551545715432991b2ae09b76fdfb6225ed9272f137134f8549a7e2c1e37b3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:49:57 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a678-874551545715432991b2ae09b76fdfb6225ed9272f137134f8549a7e2c1e37b3
OpenAccessLink	https://arxiv.org/abs/2201.06618
ParticipantIDs	arxiv_primary_2201_06618
PublicationCentury	2000
PublicationDate	2022-01-17
PublicationDateYYYYMMDD	2022-01-17
PublicationDate_xml	– month: 01 year: 2022 text: 2022-01-17 day: 17
PublicationDecade	2020
PublicationYear	2022
Score	1.8305272
SecondaryResourceType	preprint
Snippet	The transformer architectures with attention mechanisms have obtained success in Nature Language Processing (NLP), and Vision Transformers (ViTs) have recently...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Title	VAQF: Fully Automatic Software-Hardware Co-Design Framework for Low-Bit Vision Transformer
URI	https://arxiv.org/abs/2201.06618
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED21nVgQCFD5lAdWi8RJ6oYtFEKF-BBqqSqWynbOUpemCimFf8_ZaQULi2XZnp4Vv3f23QvApdKSYi6FXFtheYxpwHVfpdwoRE0EaHuFu-94eu4N3-KHaTJtAdvWwqjqa_7Z-APrjyshnKUmUUi_DW0hXMrW_cu0eZz0Vlyb9b_rSGP6oT8kke_B7kbdsazZjn1o4eIA3ifZa37NXLD3zbJVXXqXVDaiA3CtKuTu8dx12KDktz6hguXblClGmpI9lmt-M6_ZxNeBs_FWa2J1CKP8bjwY8s0vDbgiVuDOWz5xokVSExEThFooDFICzBZW9-jjwiIVUtgwckWetk_hm5IoTIiR1NERdBblArvAdByjLhI0srDEQ7EWRtJRYUygjE1MdAxdj8Ns2ZhWzBxEMw_Ryf9Tp7AjXHp_EPJQnkGnrlZ4TqRb6wuP_A9XxoI9
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VAQF%3A+Fully+Automatic+Software-Hardware+Co-Design+Framework+for+Low-Bit+Vision+Transformer&rft.au=Sun%2C+Mengshu&rft.au=Ma%2C+Haoyu&rft.au=Kang%2C+Guoliang&rft.au=Jiang%2C+Yifan&rft.date=2022-01-17&rft_id=info:doi/10.48550%2Farxiv.2201.06618&rft.externalDocID=2201_06618