A survey of techniques for optimizing transformer inference

Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown t...

Full description

Saved in:

Bibliographic Details
Published in	Journal of systems architecture Vol. 144; no. C; p. 102990
Main Authors	Chitty-Venkata, Krishna Teja, Mittal, Sparsh, Emani, Murali, Vishwanath, Venkatram, Somani, Arun K.
Format	Journal Article
Language	English
Published	Netherlands Elsevier B.V 01.11.2023 Elsevier
Subjects	FPGA Self-attention Quantization CPU GPT Neural architecture search Vision transformers GPU Hardware acceleration ASIC Knowledge distillation Pruning Transformers BERT
Online Access	Get full text

Cover

Loading…

Abstract	Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers’ memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and the accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
AbstractList	Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers’ memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and the accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
ArticleNumber	102990
Author	Chitty-Venkata, Krishna Teja Vishwanath, Venkatram Mittal, Sparsh Emani, Murali Somani, Arun K.
Author_xml	– sequence: 1 givenname: Krishna Teja surname: Chitty-Venkata fullname: Chitty-Venkata, Krishna Teja email: krishnat@iastate.edu organization: Iowa State University, Ames, IA, USA – sequence: 2 givenname: Sparsh orcidid: 0000-0002-2908-993X surname: Mittal fullname: Mittal, Sparsh email: sparsh.mittal@ece.iitr.ac.in organization: Indian Institute of Technology Roorkee, Uttarakhand, India – sequence: 3 givenname: Murali surname: Emani fullname: Emani, Murali email: memani@anl.gov organization: Argonne National Laboratory, Lemont, IL, USA – sequence: 4 givenname: Venkatram surname: Vishwanath fullname: Vishwanath, Venkatram email: venkat@anl.gov organization: Argonne National Laboratory, Lemont, IL, USA – sequence: 5 givenname: Arun K. surname: Somani fullname: Somani, Arun K. email: arun@iastate.edu organization: Iowa State University, Ames, IA, USA
BackLink	https://www.osti.gov/biblio/2004641$$D View this record in Osti.gov
BookMark	eNqFkEtLAzEUhYNUsK3-AxeD-xnzmsxEQSjFFxTc6Dpk8tCUNqnJtFB_vRnGlQtd3cvlnsM53wxMfPAGgEsEKwQRu15X6ZhkVBWGmOQT5hyegClqG1IyxOpJ3klLyoZhdAZmKa0hhHWN8BTcLoq0jwdzLIIteqM-vPvcm1TYEIuw693WfTn_XvRR-pRvWxML562JxitzDk6t3CRz8TPn4O3h_nX5VK5eHp-Xi1WpSMP7ElHGGdSo00R1kvIOcq5UU0PdasmalmqM6xwPMdJhrCjqqGVSQV0zazlmZA6uRt-QeieSckNOFbw3qhcYQsooyk90fFIxpBSNFbvotjIeBYJioCTWYqQkBkpipJRlN79k2V72Lvhc2W3-E9-NYpPbH5yJQ7iBjHZxyKaD-9vgG6BJh10
CitedBy_id	crossref_primary_10_1093_bib_bbae485 crossref_primary_10_1109_ACCESS_2025_3534098 crossref_primary_10_1145_3687310 crossref_primary_10_1016_j_ress_2024_110089 crossref_primary_10_3389_fnhum_2025_1517273 crossref_primary_10_1016_j_sysarc_2024_103247 crossref_primary_10_1007_s11554_025_01651_9 crossref_primary_10_1016_j_sysarc_2024_103260 crossref_primary_10_1109_ACCESS_2024_3478788 crossref_primary_10_1016_j_neunet_2024_106235 crossref_primary_10_1007_s40747_024_01595_w crossref_primary_10_1002_jbio_202300484 crossref_primary_10_3390_s24144451 crossref_primary_10_1109_ACCESS_2024_3368521 crossref_primary_10_1063_5_0248592 crossref_primary_10_3390_app14146125 crossref_primary_10_1007_s00170_024_13192_9 crossref_primary_10_1016_j_jclepro_2024_143663 crossref_primary_10_3390_en17030598 crossref_primary_10_1038_s41598_025_94205_9 crossref_primary_10_4018_IJISP_356893 crossref_primary_10_1088_1361_6501_ad1e20 crossref_primary_10_1007_s10489_023_05249_1 crossref_primary_10_1016_j_compeleceng_2024_109180 crossref_primary_10_7717_peerj_cs_2293 crossref_primary_10_3390_s24248158
Cites_doi	10.1109/CVPR46437.2021.01625 10.18653/v1/2020.repl4nlp-1.10 10.1145/3534678.3539260 10.1109/EMC2-NIPS53020.2019.00016 10.1007/978-3-030-01267-0_44 10.1109/SOCC49529.2020.9524802 10.18653/v1/2020.repl4nlp-1.18 10.1109/CVPR42600.2020.00154 10.1109/MM.2021.3061394 10.1109/CVPR46437.2021.00681 10.1109/CVPR52688.2022.01169 10.1145/3470496.3527438 10.1145/3453688.3461740 10.1016/j.sysarc.2019.101689 10.18653/v1/2021.acl-long.334 10.1145/3446640 10.1109/IJCNN55064.2022.9892797 10.1145/3453688.3461739 10.1109/CVPR.2019.00881 10.1109/CVPR52688.2022.01195 10.1609/aaai.v36i3.20222 10.24963/ijcai.2020/341 10.1109/ICCV48922.2021.00060 10.1109/ICCV.2017.406 10.1109/CVPR52688.2022.01185 10.1109/ICCV48922.2021.00061 10.1109/TNNLS.2020.2970494 10.1109/TVLSI.2022.3197282 10.1609/aaai.v36i10.21311 10.18653/v1/2022.emnlp-main.804 10.24963/ijcai.2021/472 10.18653/v1/2021.acl-long.400 10.18653/v1/2020.acl-main.686 10.1109/ICCV.2019.00140 10.1109/ISCAS48785.2022.9937659 10.1109/CVPR.2018.00286 10.1109/HPCA51647.2021.00018 10.1109/ACCESS.2022.3212767 10.18653/v1/2021.findings-acl.425 10.3390/electronics11213550 10.1109/ISCA45697.2020.00086 10.1109/CVPR.2018.00474 10.1109/CVPR42600.2020.00232 10.1109/ICCV48922.2021.01205 10.1609/aaai.v34i05.6462 10.1109/MICRO50266.2020.00071 10.1109/TASLP.2021.3122291 10.18653/v1/2021.emnlp-main.829 10.1109/MICRO56248.2022.00051 10.1145/3466752.3480125 10.1109/ICCV48922.2021.00986 10.1109/ISQED51717.2021.9424344 10.1145/3489517.3530504 10.1609/aaai.v35i12.17233 10.24963/ijcai.2022/164 10.1145/3524500 10.1109/ICCV.2019.00038 10.1109/ISCAS48785.2022.9937531 10.24963/ijcai.2021/165 10.1109/CVPR52688.2022.01062 10.18653/v1/2021.emnlp-main.274 10.1109/CVPR.2016.90 10.1145/3489517.3530585 10.1109/ISVLSI54635.2022.00092 10.1145/3370748.3406567 10.1109/CVPR52688.2022.00488 10.1145/3447548.3467262 10.1109/ICCV48922.2021.00008 10.1109/AICAS54282.2022.9869924 10.1007/s41095-021-0229-5 10.1145/3517206.3526271 10.1109/ICASSP39728.2021.9414403 10.1109/CVPR.2019.01099 10.1145/3474085.3475439 10.1162/tacl_a_00413 10.1109/IISWC55918.2022.00019 10.1109/ACCESS.2023.3253818 10.1145/3123939.3123982 10.1609/aaai.v35i9.16969 10.1109/ICCV.2019.00141 10.1162/coli_a_00445 10.1145/3586074 10.18653/v1/2021.emnlp-main.627 10.1016/j.sysarc.2022.102520 10.1016/j.aiopen.2021.12.003 10.1109/CVPR52688.2022.00520 10.1109/CVPR52688.2022.01174 10.1109/CVPR.2019.00492 10.18653/v1/2020.acl-main.195 10.1609/aaai.v34i05.6409 10.18653/v1/P19-1580 10.1145/3461615.3491109 10.1109/CVPR.2019.00448 10.18653/v1/2022.emnlp-main.279 10.18653/v1/2022.acl-long.107 10.1109/ISQED54688.2022.9806143 10.1007/s11432-022-3646-6 10.1609/aaai.v36i3.20202
ContentType	Journal Article
Copyright	2023 Elsevier B.V.
Copyright_xml	– notice: 2023 Elsevier B.V.
DBID	AAYXX CITATION OTOTI
DOI	10.1016/j.sysarc.2023.102990
DatabaseName	CrossRef OSTI.GOV
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1873-6165
ExternalDocumentID	2004641 10_1016_j_sysarc_2023_102990 S1383762123001698
GroupedDBID	--K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ AACTN AAEDT AAEDW AAIAV AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXUO AAYFN ABBOA ABFNM ABFRF ABJNI ABMAC ABXDB ABYKQ ACDAQ ACGFO ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADJOM ADMUD ADTZH AEBSH AECPX AEFWE AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJBFU AJOXV ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BKOMP BLXMC CS3 DU5 EBS EFJIC EFLBG EJD EO8 EO9 EP2 EP3 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HVGLF HZ~ IHE J1W JJJVA KOM M41 MO0 MS~ N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 R2- RIG ROL RPZ RXW SBC SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K TAE TN5 U5U UHS ~G- AATTM AAXKI AAYWO AAYXX ABWVN ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFJKZ AFPUW AFXIZ AGCQF AGQPQ AGRNS AIGII AIIUN AKBMS AKRWK AKYEP ANKPU APXCP BNPGV CITATION SSH OTOTI
ID	FETCH-LOGICAL-c379t-146960d1bd3cba49b099cc750d8da6784d225383163b22c41b4f6ac0d56ff9263
IEDL.DBID	.~1
ISSN	1383-7621
IngestDate	Mon Sep 30 10:30:10 EDT 2024 Tue Jul 01 00:29:19 EDT 2025 Thu Apr 24 23:01:43 EDT 2025 Fri Feb 23 02:35:33 EST 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	C
Keywords	FPGA Self-attention Quantization CPU GPT Neural architecture search Vision transformers GPU Hardware acceleration ASIC Knowledge distillation Pruning Transformers BERT
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c379t-146960d1bd3cba49b099cc750d8da6784d225383163b22c41b4f6ac0d56ff9263
Notes	USDOE DEAC02-06CH11357
ORCID	0000-0002-2908-993X 000000022908993X
OpenAccessLink	https://www.osti.gov/biblio/2004641
ParticipantIDs	osti_scitechconnect_2004641 crossref_primary_10_1016_j_sysarc_2023_102990 crossref_citationtrail_10_1016_j_sysarc_2023_102990 elsevier_sciencedirect_doi_10_1016_j_sysarc_2023_102990
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	November 2023 2023-11-00 2023-11-01
PublicationDateYYYYMMDD	2023-11-01
PublicationDate_xml	– month: 11 year: 2023 text: November 2023
PublicationDecade	2020
PublicationPlace	Netherlands
PublicationPlace_xml	– name: Netherlands
PublicationTitle	Journal of systems architecture
PublicationYear	2023
Publisher	Elsevier B.V Elsevier
Publisher_xml	– name: Elsevier B.V – name: Elsevier
References	H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all: Train one network and specialize it for efficient deployment, in: International Conference on Learning Representations, 2019. Li, Hu, Wu, Li (b238) 2022; 128 Fang, Zhou, Wang (b87) 2022; 30 Mittal (b269) 2020; 104 Dettmers, Lewis, Belkada, Zettlemoyer (b131) 2022 O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in: Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, EMC2-NIPS, 2019, pp. 36–39. Li, Talwalkar (b223) 2020 K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., Rethinking attention with Performers, in: International Conference on Learning Representations, 2020. Smith, Patwary, Norick, LeGresley, Rajbhandari, Casper, Liu, Prabhumoye, Zerveas, Korthikanti (b30) 2022 Chang, Mozafari, Chen, Clark, Meyer, Gross (b253) 2022 Rae, Borgeaud, Cai, Millican, Hoffmann, Song, Aslanides, Henderson, Ring, Young (b29) 2021 Dufter, Schmitt, Schütze (b58) 2022; 48 Shuster, Xu, Komeili, Ju, Smith, Roller, Ung, Chen, Arora, Lane (b23) 2022 X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao, DearKD: data-efficient early knowledge distillation for vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12052–12062. D. Chen, Y. Li, M. Qiu, Z. Wang, B. Li, B. Ding, H. Deng, J. Huang, W. Lin, J. Zhou, AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search, in: IJCAI, 2020. S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, K. Keutzer, Learned token pruning for transformers, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794. Sreedhar, Clemons, Venkatesan, Keckler, Horowitz (b237) 2022 C. Gong, D. Wang, M. Li, X. Chen, Z. Yan, Y. Tian, V. Chandra, et al., NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training, in: International Conference on Learning Representations, 2021. Zhang, Zhou, Chen, Wang (b207) 2022 A.H. Zadeh, I. Edo, O.M. Awad, A. Moshovos, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, 2020, pp. 811–824. Wang, Liu, Venkataramani, Sen, Chen, El Maghraoui, Srinivasan, Chang (b127) 2022 Radford, Narasimhan, Salimans, Sutskever (b63) 2018 Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, Casas, Hendricks, Welbl, Clark (b12) 2022 Liang, Jiang, Li, Tang, Yin, Zhao (b72) 2023 Wang, Wei, Dong, Bao, Yang, Zhou (b73) 2020; 33 Wang, Singh, Michael, Hill, Levy, Bowman (b159) 2018 . Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b2) 2017 Zhu, Tang, Han (b91) 2021 Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022. Niu, Kong, Yuan, Jiang, Guan, Ding, Zhao, Liu, Ren, Wang (b227) 2020 Dehghani, Djolonga, Mustafa, Padlewski, Heek, Gilmer, Steiner, Caron, Geirhos, Alabdulmohsin (b10) 2023 E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. Mehta, Rastegari (b173) 2022 Qu, Liu, Tu, Chen, Ding, Xie (b241) 2022 E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, P. Luo, Segmenting transparent object in the wild with transformer, in: International Joint Conference on Artificial Intelligence, 2021. Chen, Wu, Ni, Peng, Liu, Fu, Chao, Ling (b202) 2021; 34 R. Rizk, D. Rizk, F. Rizk, A. Kumar, M. Bayoumi, A Resource-Saving Energy-Efficient Reconfigurable Hardware Accelerator for BERT-based Deep Neural Network Language Models using FFT Multiplication, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 1675–1679. Zhang, Zheng, Yang, Li, Wang, Chao, Wang, Li, Yang, Ji (b230) 2021 G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, M. Guo, SALO: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, in: Design Automation Conference, 2022, pp. 571–576. M.A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, in: 5th Workshop on Representation Learning for NLP, 2020. Wang, Qin, Deng, Wei, Zhou, Fan, Chen, Sun, Liu, Wei (b239) 2022 Lepikhin, Lee, Xu, Chen, Firat, Huang, Krikun, Shazeer, Chen (b36) 2020 J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, A. Moshovos, Bit-pragmatic deep neural network computing, in: IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 382–394. M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X.V. Lin, J. Du, S. Iyer, R. Pasunuru, et al., Efficient Large Scale Language Modeling with Mixtures of Experts, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11699–11732. E. Iofinova, A. Peste, M. Kurtz, D. Alistarh, How well do sparse ImageNet models transfer?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12266–12276. Yuan, Xue, Chen, Wu, Sun (b125) 2022 Chitty-Venkata, Somani (b187) 2022; 55 H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, J. Song, Forward and backward information retention for accurate binary neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2250–2259. Cheong, Daniel (b84) 2019 Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray (b21) 2022; 35 A. Rock, A. Untether, O. Khalil, O. Shai, P. Grouchy, INT8 Transformers for Inference Acceleration, in: 36th Conference on Neural Information Processing Systems, NeurIPS, 2022. Zhang, Wu, Zhou, Tang, Hu (b118) 2021; 20 Ba, Kiros, Hinton (b61) 2016 Z. Song, B. Fu, F. Wu, Z. Jiang, L. Jiang, N. Jing, X. Liang, DRQ: Dynamic region-based quantization for deep neural network acceleration, in: International Symposium on Computer Architecture, ISCA, 2020, pp. 1010–1021. C. Cao, Y. Zhang, Y. Wu, H. Lu, J. Cheng, Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3763–3771. Agarap (b59) 2018 Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, Stoyanov (b155) 2019 K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. S. Huang, S. Chen, H. Peng, D. Manu, Z. Kong, G. Yuan, L. Yang, S. Wang, H. Liu, C. Ding, HMC-Tran: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU, in: Great Lakes Symposium on VLSI, 2021, pp. 169–174. H. Wang, Z. Zhang, S. Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, in: IEEE International Symposium on High-Performance Computer Architecture, HPCA, 2021, pp. 97–110. Zhang, Gu, Han, Chen, Xiao, Sun, Yao, Qi, Guan, Ke (b26) 2021; 2 Chowdhery, Narang, Devlin, Bosma, Mishra, Roberts, Barham, Chung, Sutton, Gehrmann (b32) 2022 Hendrycks, Gimpel (b60) 2016 Campos, Marques, Nguyen, Kurtz, Zhai (b96) 2022 You, Chen, Zhang, Li, Li, Liu, Wang, Lin (b236) 2020; 33 Ham, Lee, Seo, Kim, Choi, Jung, Lee (b54) 2021 Kim, Hooper, Wattanawong, Kang, Yan, Genc, Dinh, Huang, Keutzer, Mahoney (b57) 2023 Thoppilan, De Freitas, Hall, Shazeer, Kulshreshtha, Cheng, Jin, Bos, Baker, Du (b16) 2022 Krishnamoorthi (b44) 2018 Taylor, Kardas, Cucurull, Scialom, Hartshorn, Saravia, Poulton, Kerkez, Stojnic (b14) 2022 W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520. Muennighoff, Wang, Sutawika, Roberts, Biderman, Scao, Bari, Shen, Yong, Schoelkopf (b24) 2022 Zhang, Hao, Pedrycz, Gao, Tang, Wei (b192) 2022 Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–12174. S. Mehta, M. Rastegari, MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer, in: International Conference on Learning Representations, 2021. M. Ji, B. Heo, S. Park, Show, attend and distill: Knowledge distillation via attention-based feature matching, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 9, 2021, pp. 7945–7952. Chen, Dai, Liu, Chen, Yuan, Liu (b181) 2020 H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, S. Han, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7675–7688. Pan, Panda, Jiang, Wang, Feris, Oliva (b109) 2021; 34 Li, Chen, Xiao, Gu (b129) 2022 Mellor, Turner, Storkey, Crowley (b224) 2021 A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, in: ICLR, 2023. H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, C. Xu, AdderNet: Do we really need multiplications in deep learning?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1468–1477. S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained qua 10.1016/j.sysarc.2023.102990_b99 10.1016/j.sysarc.2023.102990_b186 10.1016/j.sysarc.2023.102990_b93 Li (10.1016/j.sysarc.2023.102990_b238) 2022; 128 10.1016/j.sysarc.2023.102990_b94 Narang (10.1016/j.sysarc.2023.102990_b123) 2017 10.1016/j.sysarc.2023.102990_b180 Cao (10.1016/j.sysarc.2023.102990_b262) 2022 10.1016/j.sysarc.2023.102990_b183 10.1016/j.sysarc.2023.102990_b184 10.1016/j.sysarc.2023.102990_b182 Li (10.1016/j.sysarc.2023.102990_b129) 2022 Zhou (10.1016/j.sysarc.2023.102990_b42) 2016 Li (10.1016/j.sysarc.2023.102990_b151) 2022 Qu (10.1016/j.sysarc.2023.102990_b241) 2022 Smith (10.1016/j.sysarc.2023.102990_b30) 2022 Marchisio (10.1016/j.sysarc.2023.102990_b154) 2023 10.1016/j.sysarc.2023.102990_b176 Touvron (10.1016/j.sysarc.2023.102990_b52) 2021 10.1016/j.sysarc.2023.102990_b89 10.1016/j.sysarc.2023.102990_b177 10.1016/j.sysarc.2023.102990_b174 Jia (10.1016/j.sysarc.2023.102990_b78) 2021 10.1016/j.sysarc.2023.102990_b85 10.1016/j.sysarc.2023.102990_b178 10.1016/j.sysarc.2023.102990_b83 Chang (10.1016/j.sysarc.2023.102990_b253) 2022 Sreedhar (10.1016/j.sysarc.2023.102990_b237) 2022 Liu (10.1016/j.sysarc.2023.102990_b110) 2022 10.1016/j.sysarc.2023.102990_b170 Jiao (10.1016/j.sysarc.2023.102990_b68) 2020 Lepikhin (10.1016/j.sysarc.2023.102990_b36) 2020 Dong (10.1016/j.sysarc.2023.102990_b70) 2023 Schulman (10.1016/j.sysarc.2023.102990_b3) 2022 Liu (10.1016/j.sysarc.2023.102990_b234) 2022 Yao (10.1016/j.sysarc.2023.102990_b152) 2021 Wu (10.1016/j.sysarc.2023.102990_b161) 2022; 35 Ren (10.1016/j.sysarc.2023.102990_b37) 2023 Zhao (10.1016/j.sysarc.2023.102990_b193) 2021 10.1016/j.sysarc.2023.102990_b92 10.1016/j.sysarc.2023.102990_b90 10.1016/j.sysarc.2023.102990_b77 Kim (10.1016/j.sysarc.2023.102990_b208) 2022 10.1016/j.sysarc.2023.102990_b169 Zhang (10.1016/j.sysarc.2023.102990_b192) 2022 10.1016/j.sysarc.2023.102990_b74 10.1016/j.sysarc.2023.102990_b71 10.1016/j.sysarc.2023.102990_b162 Chen (10.1016/j.sysarc.2023.102990_b98) 2021; 34 Yao (10.1016/j.sysarc.2023.102990_b146) 2022; 35 Kim (10.1016/j.sysarc.2023.102990_b57) 2023 Luo (10.1016/j.sysarc.2023.102990_b105) 2022 10.1016/j.sysarc.2023.102990_b80 Sun (10.1016/j.sysarc.2023.102990_b28) 2021 Hoffmann (10.1016/j.sysarc.2023.102990_b12) 2022 10.1016/j.sysarc.2023.102990_b67 10.1016/j.sysarc.2023.102990_b153 10.1016/j.sysarc.2023.102990_b158 10.1016/j.sysarc.2023.102990_b156 10.1016/j.sysarc.2023.102990_b157 Singhal (10.1016/j.sysarc.2023.102990_b31) 2022 Wang (10.1016/j.sysarc.2023.102990_b171) 2020 Zhang (10.1016/j.sysarc.2023.102990_b207) 2022 Hou (10.1016/j.sysarc.2023.102990_b76) 2020; 33 Taylor (10.1016/j.sysarc.2023.102990_b14) 2022 Dettmers (10.1016/j.sysarc.2023.102990_b131) 2022 Tian (10.1016/j.sysarc.2023.102990_b163) 2023 Devlin (10.1016/j.sysarc.2023.102990_b62) 2019 Wang (10.1016/j.sysarc.2023.102990_b73) 2020; 33 Zhang (10.1016/j.sysarc.2023.102990_b97) 2022 Choi (10.1016/j.sysarc.2023.102990_b139) 2018 Brown (10.1016/j.sysarc.2023.102990_b18) 2020 Wu (10.1016/j.sysarc.2023.102990_b27) 2021 Kuzmin (10.1016/j.sysarc.2023.102990_b168) 2022 Zhu (10.1016/j.sysarc.2023.102990_b91) 2021 Thoppilan (10.1016/j.sysarc.2023.102990_b16) 2022 Liu (10.1016/j.sysarc.2023.102990_b142) 2021 Yu (10.1016/j.sysarc.2023.102990_b100) 2023; 66 So (10.1016/j.sysarc.2023.102990_b199) 2021; 34 Sanh (10.1016/j.sysarc.2023.102990_b47) 2019 Chen (10.1016/j.sysarc.2023.102990_b202) 2021; 34 Zafrir (10.1016/j.sysarc.2023.102990_b79) 2021 Chitty-Venkata (10.1016/j.sysarc.2023.102990_b187) 2022; 55 Hofstätter (10.1016/j.sysarc.2023.102990_b75) 2020 Wang (10.1016/j.sysarc.2023.102990_b206) 2022 Yang (10.1016/j.sysarc.2023.102990_b204) 2022 Zhang (10.1016/j.sysarc.2023.102990_b22) 2022 You (10.1016/j.sysarc.2023.102990_b248) 2023 Ganesh (10.1016/j.sysarc.2023.102990_b102) 2021; 9 Lu (10.1016/j.sysarc.2023.102990_b175) 2022 Frantar (10.1016/j.sysarc.2023.102990_b66) 2023 10.1016/j.sysarc.2023.102990_b196 Glaese (10.1016/j.sysarc.2023.102990_b11) 2022 Kim (10.1016/j.sysarc.2023.102990_b150) 2021 10.1016/j.sysarc.2023.102990_b191 10.1016/j.sysarc.2023.102990_b194 Mueller (10.1016/j.sysarc.2023.102990_b267) 2021 Dehghani (10.1016/j.sysarc.2023.102990_b10) 2023 Yuan (10.1016/j.sysarc.2023.102990_b125) 2022 Liu (10.1016/j.sysarc.2023.102990_b189) 2022 You (10.1016/j.sysarc.2023.102990_b195) 2022 Ni (10.1016/j.sysarc.2023.102990_b212) 2022 Zhang (10.1016/j.sysarc.2023.102990_b118) 2021; 20 Darvish Rouhani (10.1016/j.sysarc.2023.102990_b165) 2020 Maaz (10.1016/j.sysarc.2023.102990_b172) 2022 Dufter (10.1016/j.sysarc.2023.102990_b58) 2022; 48 Liu (10.1016/j.sysarc.2023.102990_b128) 2021; 34 10.1016/j.sysarc.2023.102990_b220 Tay (10.1016/j.sysarc.2023.102990_b34) 2022 10.1016/j.sysarc.2023.102990_b221 10.1016/j.sysarc.2023.102990_b103 10.1016/j.sysarc.2023.102990_b104 Rao (10.1016/j.sysarc.2023.102990_b112) 2021 10.1016/j.sysarc.2023.102990_b225 10.1016/j.sysarc.2023.102990_b101 10.1016/j.sysarc.2023.102990_b222 10.1016/j.sysarc.2023.102990_b17 Huang (10.1016/j.sysarc.2023.102990_b233) 2022 10.1016/j.sysarc.2023.102990_b15 10.1016/j.sysarc.2023.102990_b13 Muennighoff (10.1016/j.sysarc.2023.102990_b24) 2022 Fournier (10.1016/j.sysarc.2023.102990_b56) 2023; 55 Kalamkar (10.1016/j.sysarc.2023.102990_b164) 2019 Wang (10.1016/j.sysarc.2023.102990_b159) 2018 Chitty-Venkata (10.1016/j.sysarc.2023.102990_b188) 2022; 10 10.1016/j.sysarc.2023.102990_b107 Gomez (10.1016/j.sysarc.2023.102990_b228) 2017 10.1016/j.sysarc.2023.102990_b108 Choquette (10.1016/j.sysarc.2023.102990_b166) 2021; 41 10.1016/j.sysarc.2023.102990_b229 Vaswani (10.1016/j.sysarc.2023.102990_b2) 2017 van Baalen (10.1016/j.sysarc.2023.102990_b167) 2023 Zhang (10.1016/j.sysarc.2023.102990_b26) 2021; 2 Lewkowycz (10.1016/j.sysarc.2023.102990_b33) 2022; 35 Su (10.1016/j.sysarc.2023.102990_b190) 2022 10.1016/j.sysarc.2023.102990_b210 10.1016/j.sysarc.2023.102990_b213 10.1016/j.sysarc.2023.102990_b211 Chitty-Venkata (10.1016/j.sysarc.2023.102990_b259) 2023 Xie (10.1016/j.sysarc.2023.102990_b254) 2021; 34 Agarap (10.1016/j.sysarc.2023.102990_b59) 2018 Carion (10.1016/j.sysarc.2023.102990_b8) 2020 Zhu (10.1016/j.sysarc.2023.102990_b214) 2021 Mellor (10.1016/j.sysarc.2023.102990_b224) 2021 Liang (10.1016/j.sysarc.2023.102990_b72) 2023 Iyer (10.1016/j.sysarc.2023.102990_b20) 2022 Dwivedi (10.1016/j.sysarc.2023.102990_b263) 2020 Bengio (10.1016/j.sysarc.2023.102990_b137) 2013 Kwon (10.1016/j.sysarc.2023.102990_b114) 2022; 35 10.1016/j.sysarc.2023.102990_b217 10.1016/j.sysarc.2023.102990_b218 You (10.1016/j.sysarc.2023.102990_b236) 2020; 33 10.1016/j.sysarc.2023.102990_b215 10.1016/j.sysarc.2023.102990_b216 10.1016/j.sysarc.2023.102990_b219 Radford (10.1016/j.sysarc.2023.102990_b64) 2019; 1 Radford (10.1016/j.sysarc.2023.102990_b63) 2018 10.1016/j.sysarc.2023.102990_b203 10.1016/j.sysarc.2023.102990_b200 Mehta (10.1016/j.sysarc.2023.102990_b173) 2022 Mao (10.1016/j.sysarc.2023.102990_b69) 2021; 5 Mishra (10.1016/j.sysarc.2023.102990_b86) 2021 Shuster (10.1016/j.sysarc.2023.102990_b23) 2022 Chowdhery (10.1016/j.sysarc.2023.102990_b32) 2022 Ying (10.1016/j.sysarc.2023.102990_b260) 2019 Wang (10.1016/j.sysarc.2023.102990_b127) 2022 Du (10.1016/j.sysarc.2023.102990_b39) 2022 Holmes (10.1016/j.sysarc.2023.102990_b88) 2021; 34 10.1016/j.sysarc.2023.102990_b205 Scao (10.1016/j.sysarc.2023.102990_b115) 2022 Liao (10.1016/j.sysarc.2023.102990_b201) 2021 10.1016/j.sysarc.2023.102990_b209 Chung (10.1016/j.sysarc.2023.102990_b35) 2022 Campos (10.1016/j.sysarc.2023.102990_b96) 2022 Chuanyang (10.1016/j.sysarc.2023.102990_b65) 2022 Rae (10.1016/j.sysarc.2023.102990_b29) 2021 Guo (10.1016/j.sysarc.2023.102990_b266) 2021; 7 Li (10.1016/j.sysarc.2023.102990_b122) 2020 10.1016/j.sysarc.2023.102990_b264 10.1016/j.sysarc.2023.102990_b144 10.1016/j.sysarc.2023.102990_b265 10.1016/j.sysarc.2023.102990_b53 10.1016/j.sysarc.2023.102990_b141 10.1016/j.sysarc.2023.102990_b51 10.1016/j.sysarc.2023.102990_b147 10.1016/j.sysarc.2023.102990_b268 10.1016/j.sysarc.2023.102990_b148 10.1016/j.sysarc.2023.102990_b50 Li (10.1016/j.sysarc.2023.102990_b223) 2020 10.1016/j.sysarc.2023.102990_b140 10.1016/j.sysarc.2023.102990_b261 Chen (10.1016/j.sysarc.2023.102990_b82) 2020; 32 Tolstikhin (10.1016/j.sysarc.2023.102990_b226) 2021; 34 Ye (10.1016/j.sysarc.2023.102990_b244) 2022 10.1016/j.sysarc.2023.102990_b149 Hinton (10.1016/j.sysarc.2023.102990_b45) 2015 10.1016/j.sysarc.2023.102990_b132 Li (10.1016/j.sysarc.2023.102990_b9) 2023 10.1016/j.sysarc.2023.102990_b133 10.1016/j.sysarc.2023.102990_b251 10.1016/j.sysarc.2023.102990_b43 Pan (10.1016/j.sysarc.2023.102990_b109) 2021; 34 10.1016/j.sysarc.2023.102990_b136 10.1016/j.sysarc.2023.102990_b257 10.1016/j.sysarc.2023.102990_b41 Chen (10.1016/j.sysarc.2023.102990_b181) 2020 Howard (10.1016/j.sysarc.2023.102990_b185) 2017 10.1016/j.sysarc.2023.102990_b258 10.1016/j.sysarc.2023.102990_b134 10.1016/j.sysarc.2023.102990_b255 10.1016/j.sysarc.2023.102990_b135 10.1016/j.sysarc.2023.102990_b256 10.1016/j.sysarc.2023.102990_b49 10.1016/j.sysarc.2023.102990_b250 10.1016/j.sysarc.2023.102990_b46 Ouyang (10.1016/j.sysarc.2023.102990_b21) 2022; 35 Cheong (10.1016/j.sysarc.2023.102990_b84) 2019 Tay (10.1016/j.sysarc.2023.102990_b55) 2020 Michel (10.1016/j.sysarc.2023.102990_b48) 2019 Sanh (10.1016/j.sysarc.2023.102990_b81) 2020; 33 Zhang (10.1016/j.sysarc.2023.102990_b230) 2021 Frumkin (10.1016/j.sysarc.2023.102990_b130) 2022 So (10.1016/j.sysarc.2023.102990_b198) 2019 Krishnamoorthi (10.1016/j.sysarc.2023.102990_b44) 2018 10.1016/j.sysarc.2023.102990_b121 10.1016/j.sysarc.2023.102990_b242 Yang (10.1016/j.sysarc.2023.102990_b106) 2022 10.1016/j.sysarc.2023.102990_b243 10.1016/j.sysarc.2023.102990_b240 Research (10.1016/j.sysarc.2023.102990_b1) 2022 10.1016/j.sysarc.2023.102990_b120 10.1016/j.sysarc.2023.102990_b246 10.1016/j.sysarc.2023.102990_b126 Zhu (10.1016/j.sysarc.2023.102990_b95) 2017 10.1016/j.sysarc.2023.102990_b124 10.1016/j.sysarc.2023.102990_b245 Wang (10.1016/j.sysarc.2023.102990_b239) 2022 Li (10.1016/j.sysarc.2023.102990_b143) 2022 10.1016/j.sysarc.2023.102990_b38 Yang (10.1016/j.sysarc.2023.102990_b252) 2022; 11 Mittal (10.1016/j.sysarc.2023.102990_b269) 2020; 104 Latifi (10.1016/j.sysarc.2023.102990_b197) 2022 Zhao (10.1016/j.sysarc.2023.102990_b247) 2022 Hsu
References_xml	– start-page: 10347 year: 2021 end-page: 10357 ident: b52 article-title: Training data-efficient image transformers & distillation through attention publication-title: International Conference on Machine Learning – reference: C. Zhao, T. Hua, Y. Shen, Q. Lou, H. Jin, Automatic Mixed-Precision Quantization Search of BERT, in: IJCAI, 2021. – start-page: 3 year: 2022 end-page: 20 ident: b172 article-title: Edgenext: Efficiently amalgamated CNN-transformer architecture for mobile vision applications publication-title: European Conference on Computer Vision – start-page: 273 year: 2023 end-page: 286 ident: b248 article-title: ViTCoD: Vision transformer acceleration via dedicated algorithm and accelerator co-design publication-title: 2023 IEEE International Symposium on High-Performance Computer Architecture – reference: B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, J.D. Hyeon, S. Park, S. Kim, S. Kim, D. Seo, et al., What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3405–3424. – start-page: 351 year: 2020 end-page: 367 ident: b181 article-title: Dynamic ReLU publication-title: European Conference on Computer Vision – year: 2022 ident: b207 article-title: AutoDistill: An end-to-end framework to explore and distill hardware-efficient language models – reference: F. Lagunas, E. Charlaix, V. Sanh, A.M. Rush, Block Pruning For Faster Transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10619–10629. – year: 2019 ident: b62 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding publication-title: NAACL-HLT – start-page: 38087 year: 2023 end-page: 38099 ident: b138 article-title: Smoothquant: Accurate and efficient post-training quantization for large language models publication-title: International Conference on Machine Learning – reference: H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all: Train one network and specialize it for efficient deployment, in: International Conference on Learning Representations, 2019. – reference: M. Xia, Z. Zhong, D. Chen, Structured Pruning Learns Compact and Accurate Models, in: 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1, 2022, pp. 1513–1528. – volume: 34 start-page: 24898 year: 2021 end-page: 24911 ident: b109 article-title: IA-RED2: Interpretability-aware redundancy reduction for vision transformers publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b12 article-title: Training compute-optimal large language models – year: 2022 ident: b175 article-title: TFormer: A transmission-friendly ViT model for IoT devices publication-title: IEEE Trans. Parallel Distrib. Syst. – year: 2023 ident: b259 article-title: Neural architecture search benchmarks: Insights and survey publication-title: IEEE Access – start-page: 1 year: 2021 end-page: 5 ident: b267 article-title: Spiking transformer networks: A rate coded approach for processing sequential data publication-title: 2021 7th International Conference on Systems and Informatics – volume: 33 start-page: 20378 year: 2020 end-page: 20389 ident: b81 article-title: Movement pruning: Adaptive sparsity by fine-tuning publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b35 article-title: Scaling instruction-finetuned language models – start-page: 7105 year: 2019 end-page: 7114 ident: b260 article-title: NAS-bench-101: Towards reproducible neural architecture search publication-title: International Conference on Machine Learning – volume: 9 start-page: 1061 year: 2021 end-page: 1080 ident: b102 article-title: Compressing large-scale transformer-based models: A case study on BERT publication-title: Trans. Assoc. Comput. Linguist. – start-page: 1 year: 2022 end-page: 3 ident: b239 article-title: A 28nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing publication-title: 2022 IEEE International Solid-State Circuits Conference, Vol. 65 – reference: M.S. Abdelfattah, A. Mehrotra, Ł. Dudziak, N.D. Lane, Zero-cost proxies for lightweight NAS, in: ICLR, 2021. – start-page: 1 year: 2022 end-page: 6 ident: b233 article-title: An automatic and efficient BERT pruning for edge AI systems publication-title: 2022 23rd International Symposium on Quality Electronic Design – start-page: 692 year: 2021 end-page: 705 ident: b54 article-title: ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks publication-title: Annual International Symposium on Computer Architecture – year: 2018 ident: b159 article-title: GLUE: A multi-task benchmark and analysis platform for natural language understanding – reference: C. Li, Z. Yu, Y. Fu, Y. Zhang, Y. Zhao, H. You, Q. Yu, Y. Wang, Y. Lin, HW-NAS-Bench: Hardware-aware neural architecture search benchmark, in: ICLR, 2021. – volume: 34 start-page: 28092 year: 2021 end-page: 28103 ident: b128 article-title: Post-training quantization for vision transformer publication-title: Adv. Neural Inf. Process. Syst. – reference: E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, P. Luo, Segmenting transparent object in the wild with transformer, in: International Joint Conference on Artificial Intelligence, 2021. – reference: Y. Wang, Y. Yang, Y. Chen, J. Bai, C. Zhang, G. Su, X. Kou, Y. Tong, M. Yang, L. Zhou, TextNAS: A neural architecture search space tailored for text representation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 05, 2020, pp. 9242–9249. – year: 2022 ident: b22 article-title: OPT: Open pre-trained transformer language models – year: 2022 ident: b129 article-title: PSAQ-ViT V2: Towards accurate and general data-free quantization for vision transformers – reference: J. Wei, M. Bosma, V.Y. Zhao, K. Guu, A.W. Yu, B. Lester, N. Du, A.M. Dai, Q.V. Le, Finetuned language models are zero-shot learners, in: International Conference on Learning Representations, 2021. – year: 2018 ident: b63 article-title: Improving Language Understanding by Generative Pre-Training – volume: 20 start-page: 1 year: 2021 end-page: 24 ident: b118 article-title: Algorithm-hardware co-design of attention mechanism on FPGA devices publication-title: ACM Trans. Embedded Comput. Syst. (TECS) – reference: S. Lu, M. Wang, S. Liang, J. Lin, Z. Wang, Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer, in: 2020 IEEE 33rd International System-on-Chip Conference, SOCC, 2020, pp. 84–89. – start-page: 9010 year: 2022 end-page: 9023 ident: b65 article-title: SAViT: Structure-aware vision transformer pruning via collaborative optimization publication-title: Advances in Neural Information Processing Systems, Vol. 35 – start-page: 205 year: 2022 end-page: 218 ident: b262 article-title: Swin-Unet: Unet-like pure transformer for medical image segmentation publication-title: European Conference on Computer Vision – reference: B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, K. Keutzer, FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10734–10742. – reference: C. Yang, Y. Wang, J. Zhang, H. Zhang, Z. Wei, Z. Lin, A. Yuille, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008. – reference: W. Chen, W. Huang, X. Du, X. Song, Z. Wang, D. Zhou, Auto-scaling Vision Transformers without Training, in: ICLR, 2022. – reference: H. Wang, Z. Zhang, S. Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, in: IEEE International Symposium on High-Performance Computer Architecture, HPCA, 2021, pp. 97–110. – year: 2017 ident: b228 article-title: The reversible residual network: Backpropagation without storing activations publication-title: Advances in Neural Information Processing Systems, Vol. 30 – year: 2016 ident: b42 article-title: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients – start-page: 25055 year: 2022 end-page: 25069 ident: b204 article-title: Searching for BurgerFormer with micro-meso-macro space design publication-title: International Conference on Machine Learning – start-page: 1877 year: 2020 end-page: 1901 ident: b18 article-title: Language models are few-shot learners publication-title: Advances in Neural Information Processing Systems, Vol. 33 – reference: A. Nagarajan, S. Sen, J.R. Stevens, A. Raghunathan, AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models, in: International Joint Conference on Neural Networks, IJCNN, 2022, pp. 1–8. – year: 2022 ident: b237 article-title: Enabling and accelerating dynamic vision transformer inference for real-time applications – year: 2022 ident: b173 article-title: Separable self-attention for mobile vision transformers publication-title: Trans. Mach. Learn. Res. – reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. – year: 2023 ident: b9 article-title: Making AI less” thirsty”: Uncovering and addressing the secret water footprint of AI models – year: 2021 ident: b270 article-title: A survey on hardware security of DNN models and accelerators publication-title: J. Syst. Archit. – reference: H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, I. King, BinaryBERT: Pushing the Limit of BERT Quantization, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, 2021, pp. 4334–4348. – reference: Z. Dong, Z. Yao, A. Gholami, M.W. Mahoney, K. Keutzer, HAWQ: Hessian aware quantization of neural networks with mixed-precision, in: IEEE/CVF International Conference on Computer Vision, 2019, pp. 293–302. – year: 2022 ident: b14 article-title: Galactica: A large language model for science – reference: Y. Bondarenko, M. Nagel, T. Blankevoort, Understanding and Overcoming the Challenges of Efficient Transformer Quantization, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7947–7969. – volume: 35 start-page: 3843 year: 2022 end-page: 3857 ident: b33 article-title: Solving quantitative reasoning problems with language models publication-title: Adv. Neural Inf. Process. Syst. – reference: C. Gong, D. Wang, M. Li, X. Chen, Z. Yan, Y. Tian, V. Chandra, et al., NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training, in: International Conference on Learning Representations, 2021. – year: 2021 ident: b19 article-title: WebGPT: Browser-assisted question-answering with human feedback – reference: S. Huang, S. Chen, H. Peng, D. Manu, Z. Kong, G. Yuan, L. Yang, S. Wang, H. Liu, C. Ding, HMC-Tran: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU, in: Great Lakes Symposium on VLSI, 2021, pp. 169–174. – reference: W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578. – year: 2021 ident: b79 article-title: Prune once for all: Sparse pre-trained language models – start-page: 154 year: 2022 end-page: 170 ident: b143 article-title: Patch similarity aware data-free quantization for vision transformers publication-title: European Conference on Computer Vision – reference: A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, A. Vaswani, Bottleneck transformers for visual recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16519–16529. – volume: 35 start-page: 27168 year: 2022 end-page: 27183 ident: b146 article-title: Zeroquant: Efficient and affordable post-training quantization for large-scale transformers publication-title: Adv. Neural Inf. Process. Syst. – reference: H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, J. Song, Forward and backward information retention for accurate binary neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2250–2259. – reference: S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M.W. Mahoney, K. Keutzer, Q-BERT: Hessian based ultra low precision quantization of BERT, in: AAAI Conference on Artificial Intelligence, Vol. 34, No. 05, 2020, pp. 8815–8821. – reference: Z. Liu, F. Li, G. Li, J. Cheng, EBERT: Efficient BERT Inference with Dynamic Structured Pruning, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 4814–4823. – year: 2022 ident: b115 article-title: BLOOM: A 176B-parameter open-access multilingual language model – start-page: 513 year: 2021 end-page: 516 ident: b142 article-title: Hardware acceleration of fully quantized bert for efficient natural language processing publication-title: 2021 Design, Automation & Test in Europe Conference & Exhibition – start-page: 213 year: 2020 end-page: 229 ident: b8 article-title: End-to-end object detection with transformers publication-title: European Conference on Computer Vision – year: 2022 ident: b23 article-title: BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage – year: 2022 ident: b168 article-title: FP8 quantization: The power of the exponent – reference: G. BARD, 2023. URL – reference: M.A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, in: 5th Workshop on Representation Learning for NLP, 2020. – reference: C. Fang, S. Guo, W. Wu, J. Lin, Z. Wang, M.K. Hsu, L. Liu, An Efficient Hardware Accelerator for Sparse Transformer Neural Networks, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 2670–2674. – reference: Y. Yin, C. Chen, L. Shang, X. Jiang, X. Chen, Q. Liu, AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1: Long Papers, 2021, pp. 5146–5157. – reference: K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, HAQ: Hardware-aware automated quantization with mixed precision, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620. – volume: 55 start-page: 1 year: 2022 end-page: 36 ident: b187 article-title: Neural architecture search survey: A hardware perspective publication-title: ACM Comput. Surv. – reference: X. Dong, C. Long, W. Xu, C. Xiao, Dual graph convolutional networks with transformer and curriculum learning for image captioning, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2615–2624. – reference: M. Ji, B. Heo, S. Park, Show, attend and distill: Knowledge distillation via attention-based feature matching, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 9, 2021, pp. 7945–7952. – reference: L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F.E. Tay, J. Feng, S. Yan, Tokens-to-token ViT: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567. – reference: D. Ma, X. Qin, X. Jiao, AxBy-ViT: Reconfigurable Approximate Computation Bypass for Vision Transformers, in: International Symposium on Quality Electronic Design, ISQED, 2022, pp. 1–5. – start-page: 1 year: 2023 end-page: 5 ident: b163 article-title: BEBERT: Efficient and robust binary ensemble BERT publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing – reference: C. Cao, Y. Zhang, Y. Wu, H. Lu, J. Cheng, Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3763–3771. – year: 2022 ident: b30 article-title: Using DeepSpeed and megatron to train megatron-turing NLG 530B, A large-scale generative language model – start-page: 10271 year: 2020 end-page: 10281 ident: b165 article-title: Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point publication-title: Advances in Neural Information Processing Systems, Vol. 33 – reference: H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search, in: International Conference on Learning Representations, 2018. – year: 2018 ident: b59 article-title: Deep learning using rectified linear units (ReLU) – year: 2015 ident: b40 article-title: Learning both weights and connections for efficient neural network publication-title: Advances in Neural Information Processing Systems, Vol. 28 – volume: 33 start-page: 2771 year: 2020 end-page: 2783 ident: b236 article-title: ShiftAddNet: A hardware-inspired deep network publication-title: Adv. Neural Inf. Process. Syst. – start-page: 4163 year: 2020 end-page: 4174 ident: b68 article-title: TinyBERT: Distilling BERT for natural language understanding publication-title: Findings of the Association for Computational Linguistics – reference: Z. Qin, W. Sun, H. Deng, D. Li, Y. Wei, B. Lv, J. Yan, L. Kong, Y. Zhong, CosFormer: Rethinking softmax in attention, in: ICLR, 2022. – volume: 35 start-page: 24101 year: 2022 end-page: 24116 ident: b114 article-title: A fast post-training pruning framework for transformers publication-title: Adv. Neural Inf. Process. Syst. – reference: A.H. Zadeh, I. Edo, O.M. Awad, A. Moshovos, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, 2020, pp. 811–824. – reference: B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, C. Ding, FTRANS: Energy-efficient acceleration of transformers using FPGA, in: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, 2020, pp. 175–180. – volume: 1 start-page: 9 year: 2019 ident: b64 article-title: Language models are unsupervised multitask learners publication-title: OpenAI blog – year: 2020 ident: b75 article-title: Improving efficient neural ranking models with cross-architecture knowledge distillation – reference: J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, T.-Y. Liu, NAS-BERT: task-agnostic and adaptive-size BERT compression with neural architecture search, in: 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1933–1943. – start-page: 14 year: 2022 end-page: 26 ident: b241 article-title: DOTA: Detect and omit weak attentions for scalable transformer acceleration publication-title: ASPLOS – reference: S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P.H. Torr, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890. – reference: Y. Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, X. Sun, Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, No. 3, 2022, pp. 2964–2972. – start-page: 599 year: 2022 end-page: 615 ident: b117 article-title: Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design publication-title: 2022 55th IEEE/ACM International Symposium on Microarchitecture – year: 2019 ident: b47 article-title: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter – year: 2023 ident: b72 article-title: HomoDistil: Homotopic task-agnostic distillation of pre-trained transformers – volume: 33 start-page: 9782 year: 2020 end-page: 9793 ident: b76 article-title: DynaBERT: Dynamic BERT with adaptive width and depth publication-title: Adv. Neural Inf. Process. Syst. – reference: S.K. Esser, J.L. McKinstry, D. Bablani, R. Appuswamy, D.S. Modha, Learned step size quantization, in: International Conference on Learning Representations, 2019. – volume: 11 start-page: 3550 year: 2022 ident: b252 article-title: EFA-Trans: An efficient and flexible acceleration architecture for transformers publication-title: Electronics – volume: 35 start-page: 3217 year: 2022 end-page: 3231 ident: b161 article-title: XTC: Extreme compression for pre-trained transformers made simple and efficient publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b3 article-title: ChatGPT: Optimizing Language Models for Dialogue – year: 2019 ident: b155 article-title: Roberta: A robustly optimized bert pretraining approach – year: 2022 ident: b20 article-title: OPT-IML: Scaling language model instruction meta learning through the lens of generalization – volume: 2 start-page: 216 year: 2021 end-page: 224 ident: b26 article-title: CPM-2: Large-scale cost-effective pre-trained language models publication-title: AI Open – year: 2018 ident: b44 article-title: Quantizing deep convolutional networks for efficient inference: A whitepaper – start-page: 7480 year: 2023 end-page: 7512 ident: b10 article-title: Scaling vision transformers to 22 billion parameters publication-title: International Conference on Machine Learning – start-page: 11875 year: 2021 end-page: 11886 ident: b152 article-title: HAWQ-v3: Dyadic neural network quantization publication-title: International Conference on Machine Learning – reference: H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, C. Xu, AdderNet: Do we really need multiplications in deep learning?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1468–1477. – volume: 34 start-page: 19974 year: 2021 end-page: 19988 ident: b98 article-title: Chasing sparsity in vision transformers: An end-to-end exploration publication-title: Adv. Neural Inf. Process. Syst. – reference: P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, E.H.-M. Sha, Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization, in: Great Lakes Symposium on VLSI, 2021, pp. 163–168. – year: 2022 ident: b131 article-title: Llm.int8: 8-bit matrix multiplication for transformers at scale – volume: 34 start-page: 6010 year: 2021 end-page: 6022 ident: b199 article-title: Searching for efficient transformers for language modeling publication-title: Adv. Neural Inf. Process. Syst. – reference: S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S.J. Hwang, C. Choi, Learning to quantize deep networks by optimizing quantization intervals with task loss, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4350–4359. – start-page: 139 year: 2022 end-page: 157 ident: b190 article-title: ViTAS: Vision transformer architecture search publication-title: European Conference on Computer Vision – reference: C. White, W. Neiswanger, Y. Savani, BANANAS: Bayesian optimization with neural architectures for neural architecture search, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 12, 2021, pp. 10293–10301. – year: 2022 ident: b130 article-title: CPT-V: A contrastive approach to post-training quantization of vision transformers – reference: Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–12174. – year: 2021 ident: b91 article-title: Vision transformer pruning – year: 2019 ident: b84 article-title: Transformers. Zip: Compressing Transformers with Pruning and Quantization – year: 2023 ident: b66 article-title: Massive language models can be accurately pruned in one-shot – year: 2021 ident: b201 article-title: Searching for efficient multi-stage vision transformers – reference: A. Chavan, Z. Shen, Z. Liu, Z. Liu, K.-T. Cheng, E.P. Xing, Vision transformer slimming: Multi-dimension searching in continuous optimization space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4931–4941. – reference: X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2020. – reference: J. Gao, H. Xu, H. Shi, X. Ren, L. Philip, X. Liang, X. Jiang, Z. Li, AutoBERT-zero: Evolving BERT backbone from scratch, in: AAAI Conference on Artificial Intelligence, Vol. 36, No. 10, 2022, pp. 10663–10671. – year: 2021 ident: b27 article-title: Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning – reference: Q. Zhou, K. Sheng, X. Zheng, K. Li, X. Sun, Y. Tian, J. Chen, R. Ji, Training-free transformer architecture search, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10894–10903. – reference: J. Choi, S. Venkataramani, V.V. Srinivasan, K. Gopalakrishnan, Z. Wang, P. Chuang, Accurate and efficient 2-bit quantized neural networks, in: Proceedings of Machine Learning and Systems, Vol. 1, 2019, pp. 348–359. – year: 2022 ident: b234 article-title: Neural architecture search on efficient transformers and beyond – year: 2022 ident: b247 article-title: An FPGA-Based transformer accelerator using output block stationary dataflow for object recognition applications publication-title: IEEE Trans. Circuits Syst. II – year: 2020 ident: b171 article-title: Linformer: Self-attention with linear complexity – reference: M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520. – year: 2023 ident: b167 article-title: FP8 versus INT8 for efficient deep learning inference – year: 2016 ident: b61 article-title: Layer normalization – volume: 7 start-page: 187 year: 2021 end-page: 199 ident: b266 article-title: PCT: Point cloud transformer publication-title: Comput. Vis. Media – start-page: 367 year: 2020 end-page: 377 ident: b223 article-title: Random search and reproducibility for neural architecture search publication-title: Uncertainty in Artificial Intelligence – year: 2017 ident: b95 article-title: To prune, or not to prune: exploring the efficacy of pruning for model compression – start-page: 14303 year: 2022 end-page: 14316 ident: b160 article-title: Bit: Robustly binarized multi-distilled transformer publication-title: Advances in Neural Information Processing Systems, Vol. 35 – reference: S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, J.-Y. Kim, DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation, in: IEEE/ACM International Symposium on Microarchitecture, MICRO, 2022, pp. 616–630. – reference: . – reference: J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, A. Moshovos, Bit-pragmatic deep neural network computing, in: IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 382–394. – year: 2022 ident: b34 article-title: Transcending scaling laws with 0.1% extra compute – reference: Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022. – start-page: 3187 year: 2020 end-page: 3199 ident: b122 article-title: Efficient transformer-based large scale language representations using hardware-friendly block structured pruning publication-title: Findings of the Association for Computational Linguistics: EMNLP 2020 – reference: H. Benmeziane, H. Ouarnoughi, K.E. Maghraoui, S. Niar, Real-time style transfer with efficient vision transformers, in: Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, 2022, pp. 31–36. – year: 2015 ident: b45 article-title: Distilling the knowledge in a neural network – reference: A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, in: ICLR, 2023. – volume: 104 year: 2020 ident: b269 article-title: A survey on modeling and improving reliability of DNN algorithms and accelerators publication-title: J. Syst. Archit. – year: 2018 ident: b139 article-title: PACT: Parameterized clipping activation for quantized neural networks – reference: N. Kitaev, Ł. Kaiser, A. Levskaya, Reformer: The efficient transformer, in: International Conference on Learning Representations, 2020. – year: 2017 ident: b185 article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications – start-page: 5877 year: 2019 end-page: 5886 ident: b198 article-title: The evolved transformer publication-title: International Conference on Machine Learning – year: 2022 ident: b192 article-title: Vision transformer with convolutions architecture search – volume: 34 start-page: 24261 year: 2021 end-page: 24272 ident: b226 article-title: MLP-Mixer: An all-MLP architecture for vision publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b32 article-title: PaLM: Scaling language modeling with Pathways – reference: M. Chen, H. Peng, J. Fu, H. Ling, AutoFormer: Searching transformers for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12270–12280. – reference: H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, S. Han, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7675–7688. – year: 2022 ident: b110 article-title: Adaptive sparse ViT: Towards learnable adaptive token pruning by fully exploiting self-attention – reference: Y. Lin, T. Zhang, P. Sun, Z. Li, S. Zhou, FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer, in: International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 1173–1179. – year: 2022 ident: b206 article-title: LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT – year: 2019 ident: b164 article-title: A study of BFLOAT16 for deep learning training – start-page: 5506 year: 2021 end-page: 5518 ident: b150 article-title: I-BERT: Integer-only BERT quantization publication-title: International Conference on Machine Learning – reference: X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao, DearKD: data-efficient early knowledge distillation for vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12052–12062. – year: 2020 ident: b55 article-title: Efficient transformers: A survey – reference: H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, C. Ding, Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning, in: International Symposium on Quality Electronic Design, ISQED, 2021, pp. 142–148. – reference: O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in: Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, EMC2-NIPS, 2019, pp. 36–39. – reference: B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization and training of neural networks for efficient integer-arithmetic-only inference, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713. – reference: Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, X. Wang, RENAS: Reinforced evolutionary neural architecture search, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4787–4796. – year: 2022 ident: b24 article-title: Crosslingual generalization through multitask finetuning – start-page: 442 year: 2023 end-page: 455 ident: b70 article-title: Heatvit: Hardware-efficient adaptive token pruning for vision transformers publication-title: 2023 IEEE International Symposium on High-Performance Computer Architecture – reference: H. Li, J. Choi, J. Ahn, A Slice and Dice Approach to Accelerate Compound Sparse Attention on GPU, in: IEEE International Symposium on Workload Characterization, IISWC, 2022. – reference: S. Yu, T. Chen, J. Shen, H. Yuan, J. Tan, S. Yang, J. Liu, Z. Wang, Unified visual transformer compression, in: ICLR, 2022. – year: 2017 ident: b123 article-title: Block-sparse recurrent neural networks – reference: H. Peng, S. Huang, S. Chen, B. Li, T. Geng, A. Li, W. Jiang, W. Wen, J. Bi, H. Liu, et al., A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining, in: Design Automation Conference, 2022, pp. 1135–1140. – reference: B. Li, S. Lu, K. Xie, Z. Wang, Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method, in: IEEE Computer Society Annual Symposium on VLSI, ISVLSI, 2022, pp. 410–413. – year: 2021 ident: b193 article-title: Memory-efficient differentiable transformer architecture search – year: 2020 ident: b227 article-title: Real-time execution of large-scale language models on mobile – reference: D. Chen, Y. Li, M. Qiu, Z. Wang, B. Li, B. Ding, H. Deng, J. Huang, W. Lin, J. Zhou, AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search, in: IJCAI, 2020. – year: 2020 ident: b36 article-title: GShard: Scaling giant models with conditional computation and automatic sharding – volume: 41 start-page: 29 year: 2021 end-page: 35 ident: b166 article-title: Nvidia a100 tensor core GPU: Performance and innovation publication-title: IEEE Micro – year: 2022 ident: b244 article-title: Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array publication-title: ACM Trans. Embedded Comput. Syst. (TECS) – reference: S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, K. Keutzer, Learned token pruning for transformers, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794. – reference: K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., Rethinking attention with Performers, in: International Conference on Learning Representations, 2020. – start-page: 328 year: 2020 end-page: 341 ident: b145 article-title: Â 3: Accelerating attention mechanisms in neural networks with approximation publication-title: 2020 IEEE International Symposium on High Performance Computer Architecture – volume: 34 start-page: 12077 year: 2021 end-page: 12090 ident: b254 article-title: SegFormer: Simple and efficient design for semantic segmentation with transformers publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b127 article-title: Deep compression of pre-trained transformer models publication-title: Advances in Neural Information Processing Systems – year: 2022 ident: b106 article-title: DTATrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. – volume: 35 start-page: 12934 year: 2022 end-page: 12949 ident: b179 article-title: EfficientFormer: Vision transformers at mobilenet speed publication-title: Adv. Neural Inf. Process. Syst. – reference: M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X.V. Lin, J. Du, S. Iyer, R. Pasunuru, et al., Efficient Large Scale Language Modeling with Mixtures of Experts, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11699–11732. – reference: E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, D. Alistarh, The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022. – volume: 32 start-page: 25 year: 2020 end-page: 35 ident: b82 article-title: Learning student networks via feature embedding publication-title: IEEE Trans. Neural Netw. Learn. Syst. – volume: 29 start-page: 3451 year: 2021 end-page: 3460 ident: b231 article-title: HuBERT: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. – start-page: 5547 year: 2022 end-page: 5569 ident: b39 article-title: GLaM: Efficient scaling of language models with mixture-of-experts publication-title: International Conference on Machine Learning – start-page: 13937 year: 2021 end-page: 13949 ident: b112 article-title: DynamicViT: Efficient vision transformers with dynamic token sparsification publication-title: Advances in Neural Information Processing Systems, Vol. 34 – year: 2019 ident: b48 article-title: Are sixteen heads really better than one? publication-title: Advances in Neural Information Processing Systems, Vol. 32 – volume: 30 start-page: 1573 year: 2022 end-page: 1586 ident: b87 article-title: An algorithm–hardware co-optimized framework for accelerating N:M sparse transformers publication-title: IEEE Trans. Very Large Scale Integr. (VLSI) Syst. – volume: 48 start-page: 733 year: 2022 end-page: 763 ident: b58 article-title: Position information in transformers: An overview publication-title: Comput. Linguist. – year: 2021 ident: b28 article-title: Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation – year: 2021 ident: b78 article-title: Efficient vision transformers via fine-grained manifold distillation – reference: X. Shi, P. Zhou, W. Chen, L. Xie, Efficient gradient-based neural architecture search for end-to-end ASR, in: Companion Publication of the 2021 International Conference on Multimodal Interaction, 2021, pp. 91–96. – reference: R. Luo, X. Tan, R. Wang, T. Qin, J. Li, S. Zhao, E. Chen, T.-Y. Liu, LightSpeech: Lightweight and fast text to speech with neural architecture search, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5699–5703. – volume: 55 start-page: 1 year: 2023 end-page: 40 ident: b56 article-title: A practical survey on faster and lighter transformers publication-title: ACM Comput. Surv. – start-page: 47 year: 2022 end-page: 61 ident: b212 article-title: NASformer: Neural architecture search for vision transformer publication-title: Asian Conference on Pattern Recognition – reference: Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2158–2170. – reference: Z. Song, B. Fu, F. Wu, Z. Jiang, L. Jiang, N. Jing, X. Liang, DRQ: Dynamic region-based quantization for deep neural network acceleration, in: International Symposium on Computer Architecture, ISCA, 2020, pp. 1010–1021. – year: 2022 ident: b197 article-title: Efficient sparsely activated transformers – reference: E. Iofinova, A. Peste, M. Kurtz, D. Alistarh, How well do sparse ImageNet models transfer?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12266–12276. – year: 2021 ident: b29 article-title: Scaling language models: Methods, analysis & insights from training Gopher – year: 2020 ident: b263 article-title: A generalization of transformer networks to graphs – reference: A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2020. – reference: H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, X. Liu, BiBERT: Accurate fully binarized BERT, in: ICLR, 2022. – year: 2016 ident: b60 article-title: Gaussian error linear units (gelus) – volume: 5 start-page: 1 year: 2021 end-page: 22 ident: b69 article-title: TPrune: Efficient transformer pruning for mobile devices publication-title: ACM Trans. Cyber-Phys. Syst. – volume: 35 start-page: 27730 year: 2022 end-page: 27744 ident: b21 article-title: Training language models to follow instructions with human feedback publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: b5 article-title: GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics – start-page: 274 year: 2022 end-page: 288 ident: b105 article-title: An attention-based token pruning method for vision transformers publication-title: International Joint Conference on Rough Sets – year: 2022 ident: b96 article-title: Sparse* BERT: Sparse models are robust – start-page: 7588 year: 2021 end-page: 7598 ident: b224 article-title: Neural architecture search without training publication-title: International Conference on Machine Learning – reference: M. Javaheripi, S. Shah, S. Mukherjee, T.L. Religa, C.C. Mendes, G.H. de Rosa, S. Bubeck, F. Koushanfar, D. Dey, LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models, in: AutoML Workshop, 2022. – volume: 66 start-page: 1 year: 2023 end-page: 2 ident: b100 article-title: A unified pruning framework for vision transformers publication-title: Sci. China Inf. Sci. – reference: A. Fan, E. Grave, A. Joulin, Reducing transformer depth on demand with structured dropout, in: International Conference on Learning Representations, 2019. – year: 2021 ident: b230 article-title: You only compress once: Towards effective and elastic BERT compression via exploit-explore stochastic nature gradient – reference: L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, Y. Liang, Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture, in: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 977–991. – volume: 34 start-page: 1818 year: 2021 end-page: 1830 ident: b88 article-title: NxMTransformer: Semi-structured sparsification for natural language understanding via ADMM publication-title: Adv. Neural Inf. Process. Syst. – reference: B. Chen, P. Li, C. Li, B. Li, L. Bai, C. Lin, M. Sun, J. Yan, W. Ouyang, GLiT: Neural architecture search for global and local image transformer, in: IEEE/CVF International Conference on Computer Vision, 2021, pp. 12–21. – year: 2021 ident: b86 article-title: Accelerating sparse deep neural networks – volume: 10 start-page: 108374 year: 2022 end-page: 108412 ident: b188 article-title: Neural architecture search for transformers: A survey publication-title: IEEE Access – reference: A.H. Zadeh, M. Mahmoud, A. Abdelhadi, A. Moshovos, Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models, in: Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 888–901. – start-page: 25566 year: 2022 end-page: 25580 ident: b195 article-title: ShiftAddNAS: Hardware-inspired search for more accurate and efficient neural networks publication-title: International Conference on Machine Learning – reference: G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, M. Guo, SALO: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, in: Design Automation Conference, 2022, pp. 571–576. – year: 2022 ident: b208 article-title: Revisiting architecture-aware knowledge distillation: Smaller models and faster search – reference: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, Z. Liu, Mobile-Former: Bridging mobilenet and transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279. – reference: A. Adhikari, A. Ram, R. Tang, W.L. Hamilton, J. Lin, Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT, in: Proceedings of the 5th Workshop on Representation Learning for NLP, 2020, pp. 72–77. – reference: M. Nagel, M.v. Baalen, T. Blankevoort, M. Welling, Data-free quantization through weight equalization and bias correction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1325–1334. – year: 2022 ident: b11 article-title: Improving alignment of dialogue agents via targeted human judgements – reference: R. Rizk, D. Rizk, F. Rizk, A. Kumar, M. Bayoumi, A Resource-Saving Energy-Efficient Reconfigurable Hardware Accelerator for BERT-based Deep Neural Network Language Models using FFT Multiplication, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 1675–1679. – reference: A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., Searching for mobilenetv3, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324. – start-page: 169 year: 2021 end-page: 182 ident: b214 article-title: Autotrans: Automating transformer design via reinforced architecture search publication-title: CCF International Conference on Natural Language Processing and Chinese Computing – reference: B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in: International Conference on Learning Representations, 2016. – reference: F. Yu, K. Huang, M. Wang, Y. Cheng, W. Chu, L. Cui, Width & Depth Pruning for Vision Transformers, in: AAAI Conference on Artificial Intelligence, Vol. 2022, AAAI, 2022. – volume: 128 year: 2022 ident: b238 article-title: DiVIT: Algorithm and architecture co-design of differential attention in vision transformer publication-title: J. Syst. Archit. – year: 2017 ident: b2 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems, Vol. 30 – year: 2013 ident: b137 article-title: Estimating or propagating gradients through stochastic neurons for conditional computation – year: 2022 ident: b1 article-title: Artificial intelligence (AI) market size, growth, Report 2022–2030 – start-page: 33 year: 2022 end-page: 49 ident: b189 article-title: UniNet: Unified architecture search with convolution, transformer, and MLP publication-title: European Conference on Computer Vision – reference: Q. Chen, C. Sun, Z. Lu, C. Gao, Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks, in: IEEE 4th International Conference on Artificial Intelligence Circuits and Systems, AICAS, 2022, pp. 25–28. – reference: S. Mehta, M. Rastegari, MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer, in: International Conference on Learning Representations, 2021. – reference: S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, in: ICLR, 2016. – year: 2022 ident: b151 article-title: I-ViT: integer-only quantization for efficient vision transformer inference – volume: 34 year: 2021 ident: b202 article-title: Searching the search space of vision transformer publication-title: Adv. Neural Inf. Process. Syst. – start-page: 191 year: 2022 end-page: 207 ident: b125 article-title: PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization publication-title: European Conference on Computer Vision – start-page: 1 year: 2022 end-page: 18 ident: b253 article-title: PipeBERT: High-throughput BERT inference for ARM big. LITTLE multi-core processors publication-title: J. Signal Process. Syst. – reference: O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical Details and Evaluation, Vol. 1, White Paper. AI21 Labs, 2021. – year: 2023 ident: b154 article-title: SwiftTron: An efficient hardware accelerator for quantized transformers – year: 2023 ident: b37 article-title: PanGu- – reference: A. Rock, A. Untether, O. Khalil, O. Shai, P. Grouchy, INT8 Transformers for Inference Acceleration, in: 36th Conference on Neural Information Processing Systems, NeurIPS, 2022. – year: 2023 ident: b57 article-title: Full stack optimization of transformer inference: A survey – reference: Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, K.-T. Cheng, Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm, in: European Conference on Computer Vision, ECCV, 2018, pp. 722–737. – reference: E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. – volume: 33 start-page: 5776 year: 2020 end-page: 5788 ident: b73 article-title: MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers publication-title: Adv. Neural Inf. Process. Syst. – start-page: 26809 year: 2022 end-page: 26823 ident: b97 article-title: Platon: Pruning large transformer models with upper confidence bound of weight importance publication-title: International Conference on Machine Learning – year: 2022 ident: b16 article-title: Lamda: Language models for dialog applications – year: 2022 ident: b31 article-title: Large language models encode clinical knowledge – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b3 – ident: 10.1016/j.sysarc.2023.102990_b176 doi: 10.1109/CVPR46437.2021.01625 – ident: 10.1016/j.sysarc.2023.102990_b71 doi: 10.18653/v1/2020.repl4nlp-1.10 – ident: 10.1016/j.sysarc.2023.102990_b103 doi: 10.1145/3534678.3539260 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b127 article-title: Deep compression of pre-trained transformer models – volume: 34 start-page: 19974 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b98 article-title: Chasing sparsity in vision transformers: An end-to-end exploration publication-title: Adv. Neural Inf. Process. Syst. – year: 2016 ident: 10.1016/j.sysarc.2023.102990_b42 – ident: 10.1016/j.sysarc.2023.102990_b50 doi: 10.1109/EMC2-NIPS53020.2019.00016 – start-page: 5877 year: 2019 ident: 10.1016/j.sysarc.2023.102990_b198 article-title: The evolved transformer – ident: 10.1016/j.sysarc.2023.102990_b156 doi: 10.1007/978-3-030-01267-0_44 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b192 – start-page: 26809 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b97 article-title: Platon: Pruning large transformer models with upper confidence bound of weight importance – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b86 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b168 – start-page: 1877 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b18 article-title: Language models are few-shot learners – ident: 10.1016/j.sysarc.2023.102990_b119 doi: 10.1109/SOCC49529.2020.9524802 – start-page: 1 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b233 article-title: An automatic and efficient BERT pruning for edge AI systems – volume: 35 start-page: 3843 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b33 article-title: Solving quantitative reasoning problems with language models publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b85 doi: 10.18653/v1/2020.repl4nlp-1.18 – start-page: 13937 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b112 article-title: DynamicViT: Efficient vision transformers with dynamic token sparsification – ident: 10.1016/j.sysarc.2023.102990_b235 doi: 10.1109/CVPR42600.2020.00154 – start-page: 273 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b248 article-title: ViTCoD: Vision transformer acceleration via dedicated algorithm and accelerator co-design – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b30 – year: 2017 ident: 10.1016/j.sysarc.2023.102990_b185 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b57 – start-page: 25566 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b195 article-title: ShiftAddNAS: Hardware-inspired search for more accurate and efficient neural networks – volume: 41 start-page: 29 issue: 2 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b166 article-title: Nvidia a100 tensor core GPU: Performance and innovation publication-title: IEEE Micro doi: 10.1109/MM.2021.3061394 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b9 – start-page: 5547 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b39 article-title: GLaM: Efficient scaling of language models with mixture-of-experts – ident: 10.1016/j.sysarc.2023.102990_b158 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b230 – ident: 10.1016/j.sysarc.2023.102990_b170 – ident: 10.1016/j.sysarc.2023.102990_b256 doi: 10.1109/CVPR46437.2021.00681 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b20 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b27 – start-page: 38087 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b138 article-title: Smoothquant: Accurate and efficient post-training quantization for large language models – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b164 – ident: 10.1016/j.sysarc.2023.102990_b177 doi: 10.1109/CVPR52688.2022.01169 – start-page: 1 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b239 article-title: A 28nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing – ident: 10.1016/j.sysarc.2023.102990_b132 doi: 10.1145/3470496.3527438 – ident: 10.1016/j.sysarc.2023.102990_b141 – start-page: 4163 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b68 article-title: TinyBERT: Distilling BERT for natural language understanding – start-page: 1 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b163 article-title: BEBERT: Efficient and robust binary ensemble BERT – volume: 35 start-page: 12934 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b179 article-title: EfficientFormer: Vision transformers at mobilenet speed publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b101 doi: 10.1145/3453688.3461740 – volume: 104 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b269 article-title: A survey on modeling and improving reliability of DNN algorithms and accelerators publication-title: J. Syst. Archit. doi: 10.1016/j.sysarc.2019.101689 – ident: 10.1016/j.sysarc.2023.102990_b162 doi: 10.18653/v1/2021.acl-long.334 – volume: 5 start-page: 1 issue: 3 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b69 article-title: TPrune: Efficient transformer pruning for mobile devices publication-title: ACM Trans. Cyber-Phys. Syst. doi: 10.1145/3446640 – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b171 – ident: 10.1016/j.sysarc.2023.102990_b83 doi: 10.1109/IJCNN55064.2022.9892797 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b201 – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b36 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b24 – ident: 10.1016/j.sysarc.2023.102990_b121 doi: 10.1145/3453688.3461739 – ident: 10.1016/j.sysarc.2023.102990_b257 – ident: 10.1016/j.sysarc.2023.102990_b232 – volume: 35 start-page: 24101 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b114 article-title: A fast post-training pruning framework for transformers publication-title: Adv. Neural Inf. Process. Syst. – volume: 20 start-page: 1 issue: 5 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b118 article-title: Algorithm-hardware co-design of attention mechanism on FPGA devices publication-title: ACM Trans. Embedded Comput. Syst. (TECS) – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b35 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b5 – ident: 10.1016/j.sysarc.2023.102990_b169 – ident: 10.1016/j.sysarc.2023.102990_b135 doi: 10.1109/CVPR.2019.00881 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b234 – ident: 10.1016/j.sysarc.2023.102990_b268 doi: 10.1109/CVPR52688.2022.01195 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b14 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b28 – ident: 10.1016/j.sysarc.2023.102990_b99 doi: 10.1609/aaai.v36i3.20222 – ident: 10.1016/j.sysarc.2023.102990_b4 – ident: 10.1016/j.sysarc.2023.102990_b215 doi: 10.24963/ijcai.2020/341 – start-page: 213 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b8 article-title: End-to-end object detection with transformers – ident: 10.1016/j.sysarc.2023.102990_b261 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b19 – ident: 10.1016/j.sysarc.2023.102990_b184 doi: 10.1109/ICCV48922.2021.00060 – volume: 33 start-page: 5776 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b73 article-title: MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers publication-title: Adv. Neural Inf. Process. Syst. – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b78 – volume: 34 start-page: 24261 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b226 article-title: MLP-Mixer: An all-MLP architecture for vision publication-title: Adv. Neural Inf. Process. Syst. – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b31 – ident: 10.1016/j.sysarc.2023.102990_b265 doi: 10.1109/ICCV.2017.406 – year: 2018 ident: 10.1016/j.sysarc.2023.102990_b59 – ident: 10.1016/j.sysarc.2023.102990_b104 doi: 10.1109/CVPR52688.2022.01185 – ident: 10.1016/j.sysarc.2023.102990_b182 doi: 10.1109/ICCV48922.2021.00061 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b1 – year: 2017 ident: 10.1016/j.sysarc.2023.102990_b2 article-title: Attention is all you need – volume: 32 start-page: 25 issue: 1 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b82 article-title: Learning student networks via feature embedding publication-title: IEEE Trans. Neural Netw. Learn. Syst. doi: 10.1109/TNNLS.2020.2970494 – volume: 30 start-page: 1573 issue: 11 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b87 article-title: An algorithm–hardware co-optimized framework for accelerating N:M sparse transformers publication-title: IEEE Trans. Very Large Scale Integr. (VLSI) Syst. doi: 10.1109/TVLSI.2022.3197282 – ident: 10.1016/j.sysarc.2023.102990_b217 doi: 10.1609/aaai.v36i10.21311 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b247 article-title: An FPGA-Based transformer accelerator using output block stationary dataflow for object recognition applications publication-title: IEEE Trans. Circuits Syst. II – ident: 10.1016/j.sysarc.2023.102990_b17 – volume: 34 start-page: 6010 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b199 article-title: Searching for efficient transformers for language modeling publication-title: Adv. Neural Inf. Process. Syst. – year: 2016 ident: 10.1016/j.sysarc.2023.102990_b60 – ident: 10.1016/j.sysarc.2023.102990_b38 doi: 10.18653/v1/2022.emnlp-main.804 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b91 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b66 – ident: 10.1016/j.sysarc.2023.102990_b196 doi: 10.24963/ijcai.2021/472 – year: 2017 ident: 10.1016/j.sysarc.2023.102990_b228 article-title: The reversible residual network: Backpropagation without storing activations – ident: 10.1016/j.sysarc.2023.102990_b205 doi: 10.18653/v1/2021.acl-long.400 – ident: 10.1016/j.sysarc.2023.102990_b53 doi: 10.18653/v1/2020.acl-main.686 – ident: 10.1016/j.sysarc.2023.102990_b186 doi: 10.1109/ICCV.2019.00140 – ident: 10.1016/j.sysarc.2023.102990_b89 doi: 10.1109/ISCAS48785.2022.9937659 – start-page: 47 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b212 article-title: NASformer: Neural architecture search for vision transformer – ident: 10.1016/j.sysarc.2023.102990_b43 doi: 10.1109/CVPR.2018.00286 – year: 2018 ident: 10.1016/j.sysarc.2023.102990_b159 – start-page: 7588 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b224 article-title: Neural architecture search without training – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b32 – start-page: 513 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b142 article-title: Hardware acceleration of fully quantized bert for efficient natural language processing – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b11 – ident: 10.1016/j.sysarc.2023.102990_b108 doi: 10.1109/HPCA51647.2021.00018 – ident: 10.1016/j.sysarc.2023.102990_b174 – volume: 35 start-page: 27730 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b21 article-title: Training language models to follow instructions with human feedback publication-title: Adv. Neural Inf. Process. Syst. – volume: 10 start-page: 108374 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b188 article-title: Neural architecture search for transformers: A survey publication-title: IEEE Access doi: 10.1109/ACCESS.2022.3212767 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b96 – ident: 10.1016/j.sysarc.2023.102990_b116 doi: 10.18653/v1/2021.findings-acl.425 – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b62 article-title: BERT: Pre-training of deep bidirectional transformers for language understanding – start-page: 10271 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b165 article-title: Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point – start-page: 3 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b172 article-title: Edgenext: Efficiently amalgamated CNN-transformer architecture for mobile vision applications – ident: 10.1016/j.sysarc.2023.102990_b222 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b208 – start-page: 9010 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b65 article-title: SAViT: Structure-aware vision transformer pruning via collaborative optimization – year: 2018 ident: 10.1016/j.sysarc.2023.102990_b139 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b29 – ident: 10.1016/j.sysarc.2023.102990_b134 – volume: 11 start-page: 3550 issue: 21 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b252 article-title: EFA-Trans: An efficient and flexible acceleration architecture for transformers publication-title: Electronics doi: 10.3390/electronics11213550 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b206 – ident: 10.1016/j.sysarc.2023.102990_b211 – ident: 10.1016/j.sysarc.2023.102990_b107 doi: 10.1109/ISCA45697.2020.00086 – ident: 10.1016/j.sysarc.2023.102990_b180 doi: 10.1109/CVPR.2018.00474 – volume: 35 start-page: 3217 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b161 article-title: XTC: Extreme compression for pre-trained transformers made simple and efficient publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b191 – ident: 10.1016/j.sysarc.2023.102990_b157 doi: 10.1109/CVPR42600.2020.00232 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b34 – ident: 10.1016/j.sysarc.2023.102990_b46 doi: 10.1109/ICCV48922.2021.01205 – year: 2015 ident: 10.1016/j.sysarc.2023.102990_b40 article-title: Learning both weights and connections for efficient neural network – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b72 – start-page: 692 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b54 article-title: ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks – start-page: 154 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b143 article-title: Patch similarity aware data-free quantization for vision transformers – start-page: 351 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b181 article-title: Dynamic ReLU – start-page: 367 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b223 article-title: Random search and reproducibility for neural architecture search – year: 2018 ident: 10.1016/j.sysarc.2023.102990_b44 – ident: 10.1016/j.sysarc.2023.102990_b218 doi: 10.1609/aaai.v34i05.6462 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b175 article-title: TFormer: A transmission-friendly ViT model for IoT devices publication-title: IEEE Trans. Parallel Distrib. Syst. – volume: 34 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b202 article-title: Searching the search space of vision transformer publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b126 doi: 10.1109/MICRO50266.2020.00071 – volume: 29 start-page: 3451 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b231 article-title: HuBERT: Self-supervised speech representation learning by masked prediction of hidden units publication-title: IEEE/ACM Trans. Audio Speech Lang. Process. doi: 10.1109/TASLP.2021.3122291 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b37 – ident: 10.1016/j.sysarc.2023.102990_b94 doi: 10.18653/v1/2021.emnlp-main.829 – ident: 10.1016/j.sysarc.2023.102990_b249 doi: 10.1109/MICRO56248.2022.00051 – year: 2015 ident: 10.1016/j.sysarc.2023.102990_b45 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b106 article-title: DTATrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b173 article-title: Separable self-attention for mobile vision transformers publication-title: Trans. Mach. Learn. Res. – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b237 – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b263 – ident: 10.1016/j.sysarc.2023.102990_b246 doi: 10.1145/3466752.3480125 – ident: 10.1016/j.sysarc.2023.102990_b7 doi: 10.1109/ICCV48922.2021.00986 – volume: 33 start-page: 9782 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b76 article-title: DynaBERT: Dynamic BERT with adaptive width and depth publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b124 doi: 10.1109/ISQED51717.2021.9424344 – ident: 10.1016/j.sysarc.2023.102990_b51 – start-page: 14303 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b160 article-title: Bit: Robustly binarized multi-distilled transformer – ident: 10.1016/j.sysarc.2023.102990_b245 doi: 10.1145/3489517.3530504 – start-page: 5506 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b150 article-title: I-BERT: Integer-only BERT quantization – ident: 10.1016/j.sysarc.2023.102990_b213 – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b48 article-title: Are sixteen heads really better than one? – ident: 10.1016/j.sysarc.2023.102990_b225 doi: 10.1609/aaai.v35i12.17233 – start-page: 1 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b253 article-title: PipeBERT: High-throughput BERT inference for ARM big. LITTLE multi-core processors publication-title: J. Signal Process. Syst. – volume: 1 start-page: 9 issue: 8 year: 2019 ident: 10.1016/j.sysarc.2023.102990_b64 article-title: Language models are unsupervised multitask learners publication-title: OpenAI blog – ident: 10.1016/j.sysarc.2023.102990_b6 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b154 – ident: 10.1016/j.sysarc.2023.102990_b153 doi: 10.24963/ijcai.2022/164 – volume: 55 start-page: 1 issue: 4 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b187 article-title: Neural architecture search survey: A hardware perspective publication-title: ACM Comput. Surv. doi: 10.1145/3524500 – start-page: 328 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b145 article-title: Â 3: Accelerating attention mechanisms in neural networks with approximation – ident: 10.1016/j.sysarc.2023.102990_b149 doi: 10.1109/ICCV.2019.00038 – ident: 10.1016/j.sysarc.2023.102990_b251 doi: 10.1109/ISCAS48785.2022.9937531 – start-page: 11875 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b152 article-title: HAWQ-v3: Dyadic neural network quantization – ident: 10.1016/j.sysarc.2023.102990_b255 doi: 10.24963/ijcai.2021/165 – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b155 – start-page: 33 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b189 article-title: UniNet: Unified architecture search with convolution, transformer, and MLP – ident: 10.1016/j.sysarc.2023.102990_b210 doi: 10.1109/CVPR52688.2022.01062 – year: 2017 ident: 10.1016/j.sysarc.2023.102990_b95 – ident: 10.1016/j.sysarc.2023.102990_b13 doi: 10.18653/v1/2021.emnlp-main.274 – ident: 10.1016/j.sysarc.2023.102990_b183 doi: 10.1109/CVPR.2016.90 – year: 2013 ident: 10.1016/j.sysarc.2023.102990_b137 – ident: 10.1016/j.sysarc.2023.102990_b243 doi: 10.1145/3489517.3530585 – ident: 10.1016/j.sysarc.2023.102990_b133 doi: 10.1109/ISVLSI54635.2022.00092 – ident: 10.1016/j.sysarc.2023.102990_b250 doi: 10.1145/3370748.3406567 – ident: 10.1016/j.sysarc.2023.102990_b111 doi: 10.1109/CVPR52688.2022.00488 – start-page: 10347 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b52 article-title: Training data-efficient image transformers & distillation through attention – start-page: 14 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b241 article-title: DOTA: Detect and omit weak attentions for scalable transformer acceleration – ident: 10.1016/j.sysarc.2023.102990_b216 doi: 10.1145/3447548.3467262 – ident: 10.1016/j.sysarc.2023.102990_b203 doi: 10.1109/ICCV48922.2021.00008 – ident: 10.1016/j.sysarc.2023.102990_b242 doi: 10.1109/AICAS54282.2022.9869924 – ident: 10.1016/j.sysarc.2023.102990_b74 – volume: 7 start-page: 187 issue: 2 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b266 article-title: PCT: Point cloud transformer publication-title: Comput. Vis. Media doi: 10.1007/s41095-021-0229-5 – volume: 33 start-page: 20378 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b81 article-title: Movement pruning: Adaptive sparsity by fine-tuning publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b15 – start-page: 139 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b190 article-title: ViTAS: Vision transformer architecture search – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b55 – ident: 10.1016/j.sysarc.2023.102990_b200 doi: 10.1145/3517206.3526271 – ident: 10.1016/j.sysarc.2023.102990_b209 doi: 10.1109/ICASSP39728.2021.9414403 – ident: 10.1016/j.sysarc.2023.102990_b229 doi: 10.1109/CVPR.2019.01099 – ident: 10.1016/j.sysarc.2023.102990_b144 – volume: 34 start-page: 1818 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b88 article-title: NxMTransformer: Semi-structured sparsification for natural language understanding via ADMM publication-title: Adv. Neural Inf. Process. Syst. – volume: 33 start-page: 2771 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b236 article-title: ShiftAddNet: A hardware-inspired deep network publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b264 doi: 10.1145/3474085.3475439 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b12 – volume: 9 start-page: 1061 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b102 article-title: Compressing large-scale transformer-based models: A case study on BERT publication-title: Trans. Assoc. Comput. Linguist. doi: 10.1162/tacl_a_00413 – ident: 10.1016/j.sysarc.2023.102990_b120 doi: 10.1109/IISWC55918.2022.00019 – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b259 article-title: Neural architecture search benchmarks: Insights and survey publication-title: IEEE Access doi: 10.1109/ACCESS.2023.3253818 – ident: 10.1016/j.sysarc.2023.102990_b258 doi: 10.1145/3123939.3123982 – ident: 10.1016/j.sysarc.2023.102990_b80 doi: 10.1609/aaai.v35i9.16969 – year: 2017 ident: 10.1016/j.sysarc.2023.102990_b123 – ident: 10.1016/j.sysarc.2023.102990_b93 – start-page: 169 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b214 article-title: Autotrans: Automating transformer design via reinforced architecture search – year: 2016 ident: 10.1016/j.sysarc.2023.102990_b61 – ident: 10.1016/j.sysarc.2023.102990_b136 doi: 10.1109/ICCV.2019.00141 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b270 article-title: A survey on hardware security of DNN models and accelerators publication-title: J. Syst. Archit. – volume: 48 start-page: 733 issue: 3 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b58 article-title: Position information in transformers: An overview publication-title: Comput. Linguist. doi: 10.1162/coli_a_00445 – ident: 10.1016/j.sysarc.2023.102990_b41 – volume: 55 start-page: 1 issue: 14s year: 2023 ident: 10.1016/j.sysarc.2023.102990_b56 article-title: A practical survey on faster and lighter transformers publication-title: ACM Comput. Surv. doi: 10.1145/3586074 – start-page: 274 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b105 article-title: An attention-based token pruning method for vision transformers – ident: 10.1016/j.sysarc.2023.102990_b148 doi: 10.18653/v1/2021.emnlp-main.627 – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b75 – ident: 10.1016/j.sysarc.2023.102990_b220 – start-page: 7105 year: 2019 ident: 10.1016/j.sysarc.2023.102990_b260 article-title: NAS-bench-101: Towards reproducible neural architecture search – volume: 34 start-page: 12077 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b254 article-title: SegFormer: Simple and efficient design for semantic segmentation with transformers publication-title: Adv. Neural Inf. Process. Syst. – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b47 – volume: 128 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b238 article-title: DiVIT: Algorithm and architecture co-design of differential attention in vision transformer publication-title: J. Syst. Archit. doi: 10.1016/j.sysarc.2022.102520 – volume: 2 start-page: 216 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b26 article-title: CPM-2: Large-scale cost-effective pre-trained language models publication-title: AI Open doi: 10.1016/j.aiopen.2021.12.003 – start-page: 3187 year: 2020 ident: 10.1016/j.sysarc.2023.102990_b122 article-title: Efficient transformer-based large scale language representations using hardware-friendly block structured pruning – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b79 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b22 – ident: 10.1016/j.sysarc.2023.102990_b178 doi: 10.1109/CVPR52688.2022.00520 – start-page: 442 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b70 article-title: Heatvit: Hardware-efficient adaptive token pruning for vision transformers – volume: 34 start-page: 28092 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b128 article-title: Post-training quantization for vision transformer publication-title: Adv. Neural Inf. Process. Syst. – year: 2020 ident: 10.1016/j.sysarc.2023.102990_b227 – ident: 10.1016/j.sysarc.2023.102990_b77 doi: 10.1109/CVPR52688.2022.01174 – start-page: 191 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b125 article-title: PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b130 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b23 – volume: 34 start-page: 24898 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b109 article-title: IA-RED2: Interpretability-aware redundancy reduction for vision transformers publication-title: Adv. Neural Inf. Process. Syst. – year: 2023 ident: 10.1016/j.sysarc.2023.102990_b167 – start-page: 205 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b262 article-title: Swin-Unet: Unet-like pure transformer for medical image segmentation – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b110 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b197 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b244 article-title: Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array publication-title: ACM Trans. Embedded Comput. Syst. (TECS) – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b131 – start-page: 25055 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b204 article-title: Searching for BurgerFormer with micro-meso-macro space design – start-page: 599 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b117 article-title: Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design – ident: 10.1016/j.sysarc.2023.102990_b221 doi: 10.1109/CVPR.2019.00492 – start-page: 7480 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b10 article-title: Scaling vision transformers to 22 billion parameters – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b115 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b129 – ident: 10.1016/j.sysarc.2023.102990_b67 doi: 10.18653/v1/2020.acl-main.195 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b207 – ident: 10.1016/j.sysarc.2023.102990_b147 doi: 10.1609/aaai.v34i05.6409 – ident: 10.1016/j.sysarc.2023.102990_b92 doi: 10.18653/v1/P19-1580 – year: 2021 ident: 10.1016/j.sysarc.2023.102990_b193 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b16 – year: 2018 ident: 10.1016/j.sysarc.2023.102990_b63 – year: 2022 ident: 10.1016/j.sysarc.2023.102990_b151 – ident: 10.1016/j.sysarc.2023.102990_b194 doi: 10.1145/3461615.3491109 – start-page: 1 year: 2021 ident: 10.1016/j.sysarc.2023.102990_b267 article-title: Spiking transformer networks: A rate coded approach for processing sequential data – ident: 10.1016/j.sysarc.2023.102990_b140 doi: 10.1109/CVPR.2019.00448 – volume: 35 start-page: 27168 year: 2022 ident: 10.1016/j.sysarc.2023.102990_b146 article-title: Zeroquant: Efficient and affordable post-training quantization for large-scale transformers publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.sysarc.2023.102990_b49 doi: 10.18653/v1/2022.emnlp-main.279 – ident: 10.1016/j.sysarc.2023.102990_b90 doi: 10.18653/v1/2022.acl-long.107 – ident: 10.1016/j.sysarc.2023.102990_b240 doi: 10.1109/ISQED54688.2022.9806143 – ident: 10.1016/j.sysarc.2023.102990_b25 – year: 2019 ident: 10.1016/j.sysarc.2023.102990_b84 – volume: 66 start-page: 1 issue: 7 year: 2023 ident: 10.1016/j.sysarc.2023.102990_b100 article-title: A unified pruning framework for vision transformers publication-title: Sci. China Inf. Sci. doi: 10.1007/s11432-022-3646-6 – ident: 10.1016/j.sysarc.2023.102990_b219 – ident: 10.1016/j.sysarc.2023.102990_b113 doi: 10.1609/aaai.v36i3.20202
SSID	ssj0005512
Score	2.5835602
SecondaryResourceType	review_article
Snippet	Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including...
SourceID	osti crossref elsevier
SourceType	Open Access Repository Enrichment Source Index Database Publisher
StartPage	102990
Title	A survey of techniques for optimizing transformer inference
URI	https://dx.doi.org/10.1016/j.sysarc.2023.102990 https://www.osti.gov/biblio/2004641
Volume	144
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEF5KvXjxLdZq2YPXbZvsNg88lWKpir1oobewjwQi2pQmFerB3-5Mki0KQsFjws6SzM5-8y07D0JuYic0Pk848x0dMKFjzaRMXGYwWUb2Ax2WN_hPU28yEw_zwbxBRjYXBsMqa-yvML1E6_pNr9Zmb5mmvWcHD1ceQi_ylhATfoXw0cq7Xz_CPAbVjScMZjjaps-VMV75Jgdz6mILcaxhUCLz3-6pmcGO--F5xkfkoKaMdFh91TFpxIsTcmjbMdB6d56S2yHN16uPeEOzhG5rs-YUaCnNABne00_wU7SwVBVkU5vud0Zm47uX0YTVvRGY5n5YMAA4OHsYRxmulRShAqanNbh_ExgJDkgY2Kjwz0C3lOtq4SiReFL3zcBLktD1-DlpLrJFfEFoqE3Ql57mWigB7EjGnuGwXm4ifAV8sUW4VUmk68Lh2L_iLbIRYq9RpcgIFRlVimwRtpVaVoUzdoz3rbajXwYQAbbvkGzj4qAU6lZjgBCIlUd_4Vz-e9422cenKvHwijSL1Tq-BgZSqE5pYh2yN7x_nEy_AZyl2m0
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JS8NAFH7U9KAXd7HWZQ5exzbJZMNTKZbWLhdb6C1MZhKIaFOaVKi_3jdZqoJQ8JrkheSbme99w7wF4D7UPemYkUkdXbiUiVBQziODSpUsw9uu8PIT_PHE7s_Y89ya16Bb5cKosMqS-wtOz9m6vNIq0Wwt47j1oqvNla2oV-kWz92DuqpOZWlQ7wyG_cl3pIdVHHri81QZVBl0eZhXuklxRj2oLuKqjEFOzn97KC3BRffD-fSO4bBUjaRTfNgJ1MLFKRxVHRlIuUDP4LFD0vXqI9yQJCLb8qwpQWVKEiSH9_gTXRXJKrWKtnGV8XcOs97TtNunZXsEKkzHyyhyHG4_pB5IUwSceQGKPSFQAUhXcvRBTOJaxX9GxRUYhmB6wCKbi7a07CjyDNu8AG2RLMJLIJ6QbpvbwhQsYCiQeGhLE4fMiJgToGRsgFlB4ouydrhqYfHmV0Fir34BpK-A9AsgG0C3VsuidsaO550Kbf_XHPCR3ndYNtXgKCuFrVAxQmiW7_6ZfvXv997Bfn86HvmjwWTYhAN1p8hDvAYtW63DGxQkWXBbTrgvJrndHg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+survey+of+techniques+for+optimizing+transformer+inference&rft.jtitle=Journal+of+systems+architecture&rft.au=Chitty-Venkata%2C+Krishna+Teja&rft.au=Mittal%2C+Sparsh&rft.au=Emani%2C+Murali&rft.au=Vishwanath%2C+Venkatram&rft.date=2023-11-01&rft.issn=1383-7621&rft.volume=144&rft.spage=102990&rft_id=info:doi/10.1016%2Fj.sysarc.2023.102990&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_sysarc_2023_102990
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1383-7621&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1383-7621&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1383-7621&client=summon