A survey of techniques for optimizing transformer inference

Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown t...

Full description

Saved in:
Bibliographic Details
Published inJournal of systems architecture Vol. 144; no. C; p. 102990
Main Authors Chitty-Venkata, Krishna Teja, Mittal, Sparsh, Emani, Murali, Vishwanath, Venkatram, Somani, Arun K.
Format Journal Article
LanguageEnglish
Published Netherlands Elsevier B.V 01.11.2023
Elsevier
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers’ memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and the accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
AbstractList Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained Transformer (GPT) and Vision Transformer (ViT), have shown their effectiveness across Natural Language Processing (NLP) and Computer Vision (CV) domains. Transformer-based networks such as ChatGPT have impacted the lives of common men. However, the quest for high predictive performance has led to an exponential increase in transformers’ memory and compute footprint. Researchers have proposed techniques to optimize transformer inference at all levels of abstraction. This paper presents a comprehensive survey of techniques for optimizing the inference phase of transformer networks. We survey techniques such as knowledge distillation, pruning, quantization, neural architecture search and lightweight network design at the algorithmic level. We further review hardware-level optimization techniques and the design of novel hardware accelerators for transformers. We summarize the quantitative results on the number of parameters/FLOPs and the accuracy of several models/techniques to showcase the tradeoff exercised by them. We also outline future directions in this rapidly evolving field of research. We believe that this survey will educate both novice and seasoned researchers and also spark a plethora of research efforts in this field.
ArticleNumber 102990
Author Chitty-Venkata, Krishna Teja
Vishwanath, Venkatram
Mittal, Sparsh
Emani, Murali
Somani, Arun K.
Author_xml – sequence: 1
  givenname: Krishna Teja
  surname: Chitty-Venkata
  fullname: Chitty-Venkata, Krishna Teja
  email: krishnat@iastate.edu
  organization: Iowa State University, Ames, IA, USA
– sequence: 2
  givenname: Sparsh
  orcidid: 0000-0002-2908-993X
  surname: Mittal
  fullname: Mittal, Sparsh
  email: sparsh.mittal@ece.iitr.ac.in
  organization: Indian Institute of Technology Roorkee, Uttarakhand, India
– sequence: 3
  givenname: Murali
  surname: Emani
  fullname: Emani, Murali
  email: memani@anl.gov
  organization: Argonne National Laboratory, Lemont, IL, USA
– sequence: 4
  givenname: Venkatram
  surname: Vishwanath
  fullname: Vishwanath, Venkatram
  email: venkat@anl.gov
  organization: Argonne National Laboratory, Lemont, IL, USA
– sequence: 5
  givenname: Arun K.
  surname: Somani
  fullname: Somani, Arun K.
  email: arun@iastate.edu
  organization: Iowa State University, Ames, IA, USA
BackLink https://www.osti.gov/biblio/2004641$$D View this record in Osti.gov
BookMark eNqFkEtLAzEUhYNUsK3-AxeD-xnzmsxEQSjFFxTc6Dpk8tCUNqnJtFB_vRnGlQtd3cvlnsM53wxMfPAGgEsEKwQRu15X6ZhkVBWGmOQT5hyegClqG1IyxOpJ3klLyoZhdAZmKa0hhHWN8BTcLoq0jwdzLIIteqM-vPvcm1TYEIuw693WfTn_XvRR-pRvWxML562JxitzDk6t3CRz8TPn4O3h_nX5VK5eHp-Xi1WpSMP7ElHGGdSo00R1kvIOcq5UU0PdasmalmqM6xwPMdJhrCjqqGVSQV0zazlmZA6uRt-QeieSckNOFbw3qhcYQsooyk90fFIxpBSNFbvotjIeBYJioCTWYqQkBkpipJRlN79k2V72Lvhc2W3-E9-NYpPbH5yJQ7iBjHZxyKaD-9vgG6BJh10
CitedBy_id crossref_primary_10_1093_bib_bbae485
crossref_primary_10_1109_ACCESS_2025_3534098
crossref_primary_10_1145_3687310
crossref_primary_10_1016_j_ress_2024_110089
crossref_primary_10_3389_fnhum_2025_1517273
crossref_primary_10_1016_j_sysarc_2024_103247
crossref_primary_10_1007_s11554_025_01651_9
crossref_primary_10_1016_j_sysarc_2024_103260
crossref_primary_10_1109_ACCESS_2024_3478788
crossref_primary_10_1016_j_neunet_2024_106235
crossref_primary_10_1007_s40747_024_01595_w
crossref_primary_10_1002_jbio_202300484
crossref_primary_10_3390_s24144451
crossref_primary_10_1109_ACCESS_2024_3368521
crossref_primary_10_1063_5_0248592
crossref_primary_10_3390_app14146125
crossref_primary_10_1007_s00170_024_13192_9
crossref_primary_10_1016_j_jclepro_2024_143663
crossref_primary_10_3390_en17030598
crossref_primary_10_1038_s41598_025_94205_9
crossref_primary_10_4018_IJISP_356893
crossref_primary_10_1088_1361_6501_ad1e20
crossref_primary_10_1007_s10489_023_05249_1
crossref_primary_10_1016_j_compeleceng_2024_109180
crossref_primary_10_7717_peerj_cs_2293
crossref_primary_10_3390_s24248158
Cites_doi 10.1109/CVPR46437.2021.01625
10.18653/v1/2020.repl4nlp-1.10
10.1145/3534678.3539260
10.1109/EMC2-NIPS53020.2019.00016
10.1007/978-3-030-01267-0_44
10.1109/SOCC49529.2020.9524802
10.18653/v1/2020.repl4nlp-1.18
10.1109/CVPR42600.2020.00154
10.1109/MM.2021.3061394
10.1109/CVPR46437.2021.00681
10.1109/CVPR52688.2022.01169
10.1145/3470496.3527438
10.1145/3453688.3461740
10.1016/j.sysarc.2019.101689
10.18653/v1/2021.acl-long.334
10.1145/3446640
10.1109/IJCNN55064.2022.9892797
10.1145/3453688.3461739
10.1109/CVPR.2019.00881
10.1109/CVPR52688.2022.01195
10.1609/aaai.v36i3.20222
10.24963/ijcai.2020/341
10.1109/ICCV48922.2021.00060
10.1109/ICCV.2017.406
10.1109/CVPR52688.2022.01185
10.1109/ICCV48922.2021.00061
10.1109/TNNLS.2020.2970494
10.1109/TVLSI.2022.3197282
10.1609/aaai.v36i10.21311
10.18653/v1/2022.emnlp-main.804
10.24963/ijcai.2021/472
10.18653/v1/2021.acl-long.400
10.18653/v1/2020.acl-main.686
10.1109/ICCV.2019.00140
10.1109/ISCAS48785.2022.9937659
10.1109/CVPR.2018.00286
10.1109/HPCA51647.2021.00018
10.1109/ACCESS.2022.3212767
10.18653/v1/2021.findings-acl.425
10.3390/electronics11213550
10.1109/ISCA45697.2020.00086
10.1109/CVPR.2018.00474
10.1109/CVPR42600.2020.00232
10.1109/ICCV48922.2021.01205
10.1609/aaai.v34i05.6462
10.1109/MICRO50266.2020.00071
10.1109/TASLP.2021.3122291
10.18653/v1/2021.emnlp-main.829
10.1109/MICRO56248.2022.00051
10.1145/3466752.3480125
10.1109/ICCV48922.2021.00986
10.1109/ISQED51717.2021.9424344
10.1145/3489517.3530504
10.1609/aaai.v35i12.17233
10.24963/ijcai.2022/164
10.1145/3524500
10.1109/ICCV.2019.00038
10.1109/ISCAS48785.2022.9937531
10.24963/ijcai.2021/165
10.1109/CVPR52688.2022.01062
10.18653/v1/2021.emnlp-main.274
10.1109/CVPR.2016.90
10.1145/3489517.3530585
10.1109/ISVLSI54635.2022.00092
10.1145/3370748.3406567
10.1109/CVPR52688.2022.00488
10.1145/3447548.3467262
10.1109/ICCV48922.2021.00008
10.1109/AICAS54282.2022.9869924
10.1007/s41095-021-0229-5
10.1145/3517206.3526271
10.1109/ICASSP39728.2021.9414403
10.1109/CVPR.2019.01099
10.1145/3474085.3475439
10.1162/tacl_a_00413
10.1109/IISWC55918.2022.00019
10.1109/ACCESS.2023.3253818
10.1145/3123939.3123982
10.1609/aaai.v35i9.16969
10.1109/ICCV.2019.00141
10.1162/coli_a_00445
10.1145/3586074
10.18653/v1/2021.emnlp-main.627
10.1016/j.sysarc.2022.102520
10.1016/j.aiopen.2021.12.003
10.1109/CVPR52688.2022.00520
10.1109/CVPR52688.2022.01174
10.1109/CVPR.2019.00492
10.18653/v1/2020.acl-main.195
10.1609/aaai.v34i05.6409
10.18653/v1/P19-1580
10.1145/3461615.3491109
10.1109/CVPR.2019.00448
10.18653/v1/2022.emnlp-main.279
10.18653/v1/2022.acl-long.107
10.1109/ISQED54688.2022.9806143
10.1007/s11432-022-3646-6
10.1609/aaai.v36i3.20202
ContentType Journal Article
Copyright 2023 Elsevier B.V.
Copyright_xml – notice: 2023 Elsevier B.V.
DBID AAYXX
CITATION
OTOTI
DOI 10.1016/j.sysarc.2023.102990
DatabaseName CrossRef
OSTI.GOV
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1873-6165
ExternalDocumentID 2004641
10_1016_j_sysarc_2023_102990
S1383762123001698
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
AACTN
AAEDT
AAEDW
AAIAV
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXUO
AAYFN
ABBOA
ABFNM
ABFRF
ABJNI
ABMAC
ABXDB
ABYKQ
ACDAQ
ACGFO
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEFWE
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJBFU
AJOXV
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BKOMP
BLXMC
CS3
DU5
EBS
EFJIC
EFLBG
EJD
EO8
EO9
EP2
EP3
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
M41
MO0
MS~
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
PQQKQ
Q38
R2-
RIG
ROL
RPZ
RXW
SBC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TAE
TN5
U5U
UHS
~G-
AATTM
AAXKI
AAYWO
AAYXX
ABWVN
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AFXIZ
AGCQF
AGQPQ
AGRNS
AIGII
AIIUN
AKBMS
AKRWK
AKYEP
ANKPU
APXCP
BNPGV
CITATION
SSH
OTOTI
ID FETCH-LOGICAL-c379t-146960d1bd3cba49b099cc750d8da6784d225383163b22c41b4f6ac0d56ff9263
IEDL.DBID .~1
ISSN 1383-7621
IngestDate Mon Sep 30 10:30:10 EDT 2024
Tue Jul 01 00:29:19 EDT 2025
Thu Apr 24 23:01:43 EDT 2025
Fri Feb 23 02:35:33 EST 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue C
Keywords FPGA
Self-attention
Quantization
CPU
GPT
Neural architecture search
Vision transformers
GPU
Hardware acceleration
ASIC
Knowledge distillation
Pruning
Transformers
BERT
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c379t-146960d1bd3cba49b099cc750d8da6784d225383163b22c41b4f6ac0d56ff9263
Notes USDOE
DEAC02-06CH11357
ORCID 0000-0002-2908-993X
000000022908993X
OpenAccessLink https://www.osti.gov/biblio/2004641
ParticipantIDs osti_scitechconnect_2004641
crossref_primary_10_1016_j_sysarc_2023_102990
crossref_citationtrail_10_1016_j_sysarc_2023_102990
elsevier_sciencedirect_doi_10_1016_j_sysarc_2023_102990
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate November 2023
2023-11-00
2023-11-01
PublicationDateYYYYMMDD 2023-11-01
PublicationDate_xml – month: 11
  year: 2023
  text: November 2023
PublicationDecade 2020
PublicationPlace Netherlands
PublicationPlace_xml – name: Netherlands
PublicationTitle Journal of systems architecture
PublicationYear 2023
Publisher Elsevier B.V
Elsevier
Publisher_xml – name: Elsevier B.V
– name: Elsevier
References H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all: Train one network and specialize it for efficient deployment, in: International Conference on Learning Representations, 2019.
Li, Hu, Wu, Li (b238) 2022; 128
Fang, Zhou, Wang (b87) 2022; 30
Mittal (b269) 2020; 104
Dettmers, Lewis, Belkada, Zettlemoyer (b131) 2022
O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in: Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, EMC2-NIPS, 2019, pp. 36–39.
Li, Talwalkar (b223) 2020
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., Rethinking attention with Performers, in: International Conference on Learning Representations, 2020.
Smith, Patwary, Norick, LeGresley, Rajbhandari, Casper, Liu, Prabhumoye, Zerveas, Korthikanti (b30) 2022
Chang, Mozafari, Chen, Clark, Meyer, Gross (b253) 2022
Rae, Borgeaud, Cai, Millican, Hoffmann, Song, Aslanides, Henderson, Ring, Young (b29) 2021
Dufter, Schmitt, Schütze (b58) 2022; 48
Shuster, Xu, Komeili, Ju, Smith, Roller, Ung, Chen, Arora, Lane (b23) 2022
X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao, DearKD: data-efficient early knowledge distillation for vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12052–12062.
D. Chen, Y. Li, M. Qiu, Z. Wang, B. Li, B. Ding, H. Deng, J. Huang, W. Lin, J. Zhou, AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search, in: IJCAI, 2020.
S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, K. Keutzer, Learned token pruning for transformers, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794.
Sreedhar, Clemons, Venkatesan, Keckler, Horowitz (b237) 2022
C. Gong, D. Wang, M. Li, X. Chen, Z. Yan, Y. Tian, V. Chandra, et al., NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training, in: International Conference on Learning Representations, 2021.
Zhang, Zhou, Chen, Wang (b207) 2022
A.H. Zadeh, I. Edo, O.M. Awad, A. Moshovos, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, 2020, pp. 811–824.
Wang, Liu, Venkataramani, Sen, Chen, El Maghraoui, Srinivasan, Chang (b127) 2022
Radford, Narasimhan, Salimans, Sutskever (b63) 2018
Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, Rutherford, Casas, Hendricks, Welbl, Clark (b12) 2022
Liang, Jiang, Li, Tang, Yin, Zhao (b72) 2023
Wang, Wei, Dong, Bao, Yang, Zhou (b73) 2020; 33
Wang, Singh, Michael, Hill, Levy, Bowman (b159) 2018
.
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b2) 2017
Zhu, Tang, Han (b91) 2021
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
Niu, Kong, Yuan, Jiang, Guan, Ding, Zhao, Liu, Ren, Wang (b227) 2020
Dehghani, Djolonga, Mustafa, Padlewski, Heek, Gilmer, Steiner, Caron, Geirhos, Alabdulmohsin (b10) 2023
E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
Mehta, Rastegari (b173) 2022
Qu, Liu, Tu, Chen, Ding, Xie (b241) 2022
E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, P. Luo, Segmenting transparent object in the wild with transformer, in: International Joint Conference on Artificial Intelligence, 2021.
Chen, Wu, Ni, Peng, Liu, Fu, Chao, Ling (b202) 2021; 34
R. Rizk, D. Rizk, F. Rizk, A. Kumar, M. Bayoumi, A Resource-Saving Energy-Efficient Reconfigurable Hardware Accelerator for BERT-based Deep Neural Network Language Models using FFT Multiplication, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 1675–1679.
Zhang, Zheng, Yang, Li, Wang, Chao, Wang, Li, Yang, Ji (b230) 2021
G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, M. Guo, SALO: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, in: Design Automation Conference, 2022, pp. 571–576.
M.A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, in: 5th Workshop on Representation Learning for NLP, 2020.
Wang, Qin, Deng, Wei, Zhou, Fan, Chen, Sun, Liu, Wei (b239) 2022
Lepikhin, Lee, Xu, Chen, Firat, Huang, Krikun, Shazeer, Chen (b36) 2020
J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, A. Moshovos, Bit-pragmatic deep neural network computing, in: IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 382–394.
M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X.V. Lin, J. Du, S. Iyer, R. Pasunuru, et al., Efficient Large Scale Language Modeling with Mixtures of Experts, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11699–11732.
E. Iofinova, A. Peste, M. Kurtz, D. Alistarh, How well do sparse ImageNet models transfer?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12266–12276.
Yuan, Xue, Chen, Wu, Sun (b125) 2022
Chitty-Venkata, Somani (b187) 2022; 55
H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, J. Song, Forward and backward information retention for accurate binary neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2250–2259.
Cheong, Daniel (b84) 2019
Ouyang, Wu, Jiang, Almeida, Wainwright, Mishkin, Zhang, Agarwal, Slama, Ray (b21) 2022; 35
A. Rock, A. Untether, O. Khalil, O. Shai, P. Grouchy, INT8 Transformers for Inference Acceleration, in: 36th Conference on Neural Information Processing Systems, NeurIPS, 2022.
Zhang, Wu, Zhou, Tang, Hu (b118) 2021; 20
Ba, Kiros, Hinton (b61) 2016
Z. Song, B. Fu, F. Wu, Z. Jiang, L. Jiang, N. Jing, X. Liang, DRQ: Dynamic region-based quantization for deep neural network acceleration, in: International Symposium on Computer Architecture, ISCA, 2020, pp. 1010–1021.
C. Cao, Y. Zhang, Y. Wu, H. Lu, J. Cheng, Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3763–3771.
Agarap (b59) 2018
Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, Stoyanov (b155) 2019
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
S. Huang, S. Chen, H. Peng, D. Manu, Z. Kong, G. Yuan, L. Yang, S. Wang, H. Liu, C. Ding, HMC-Tran: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU, in: Great Lakes Symposium on VLSI, 2021, pp. 169–174.
H. Wang, Z. Zhang, S. Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, in: IEEE International Symposium on High-Performance Computer Architecture, HPCA, 2021, pp. 97–110.
Zhang, Gu, Han, Chen, Xiao, Sun, Yao, Qi, Guan, Ke (b26) 2021; 2
Chowdhery, Narang, Devlin, Bosma, Mishra, Roberts, Barham, Chung, Sutton, Gehrmann (b32) 2022
Hendrycks, Gimpel (b60) 2016
Campos, Marques, Nguyen, Kurtz, Zhai (b96) 2022
You, Chen, Zhang, Li, Li, Liu, Wang, Lin (b236) 2020; 33
Ham, Lee, Seo, Kim, Choi, Jung, Lee (b54) 2021
Kim, Hooper, Wattanawong, Kang, Yan, Genc, Dinh, Huang, Keutzer, Mahoney (b57) 2023
Thoppilan, De Freitas, Hall, Shazeer, Kulshreshtha, Cheng, Jin, Bos, Baker, Du (b16) 2022
Krishnamoorthi (b44) 2018
Taylor, Kardas, Cucurull, Scialom, Hartshorn, Saravia, Poulton, Kerkez, Stojnic (b14) 2022
W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
Muennighoff, Wang, Sutawika, Roberts, Biderman, Scao, Bari, Shen, Yong, Schoelkopf (b24) 2022
Zhang, Hao, Pedrycz, Gao, Tang, Wei (b192) 2022
Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–12174.
S. Mehta, M. Rastegari, MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer, in: International Conference on Learning Representations, 2021.
M. Ji, B. Heo, S. Park, Show, attend and distill: Knowledge distillation via attention-based feature matching, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 9, 2021, pp. 7945–7952.
Chen, Dai, Liu, Chen, Yuan, Liu (b181) 2020
H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, S. Han, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7675–7688.
Pan, Panda, Jiang, Wang, Feris, Oliva (b109) 2021; 34
Li, Chen, Xiao, Gu (b129) 2022
Mellor, Turner, Storkey, Crowley (b224) 2021
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, in: ICLR, 2023.
H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, C. Xu, AdderNet: Do we really need multiplications in deep learning?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1468–1477.
S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained qua
10.1016/j.sysarc.2023.102990_b99
10.1016/j.sysarc.2023.102990_b186
10.1016/j.sysarc.2023.102990_b93
Li (10.1016/j.sysarc.2023.102990_b238) 2022; 128
10.1016/j.sysarc.2023.102990_b94
Narang (10.1016/j.sysarc.2023.102990_b123) 2017
10.1016/j.sysarc.2023.102990_b180
Cao (10.1016/j.sysarc.2023.102990_b262) 2022
10.1016/j.sysarc.2023.102990_b183
10.1016/j.sysarc.2023.102990_b184
10.1016/j.sysarc.2023.102990_b182
Li (10.1016/j.sysarc.2023.102990_b129) 2022
Zhou (10.1016/j.sysarc.2023.102990_b42) 2016
Li (10.1016/j.sysarc.2023.102990_b151) 2022
Qu (10.1016/j.sysarc.2023.102990_b241) 2022
Smith (10.1016/j.sysarc.2023.102990_b30) 2022
Marchisio (10.1016/j.sysarc.2023.102990_b154) 2023
10.1016/j.sysarc.2023.102990_b176
Touvron (10.1016/j.sysarc.2023.102990_b52) 2021
10.1016/j.sysarc.2023.102990_b89
10.1016/j.sysarc.2023.102990_b177
10.1016/j.sysarc.2023.102990_b174
Jia (10.1016/j.sysarc.2023.102990_b78) 2021
10.1016/j.sysarc.2023.102990_b85
10.1016/j.sysarc.2023.102990_b178
10.1016/j.sysarc.2023.102990_b83
Chang (10.1016/j.sysarc.2023.102990_b253) 2022
Sreedhar (10.1016/j.sysarc.2023.102990_b237) 2022
Liu (10.1016/j.sysarc.2023.102990_b110) 2022
10.1016/j.sysarc.2023.102990_b170
Jiao (10.1016/j.sysarc.2023.102990_b68) 2020
Lepikhin (10.1016/j.sysarc.2023.102990_b36) 2020
Dong (10.1016/j.sysarc.2023.102990_b70) 2023
Schulman (10.1016/j.sysarc.2023.102990_b3) 2022
Liu (10.1016/j.sysarc.2023.102990_b234) 2022
Yao (10.1016/j.sysarc.2023.102990_b152) 2021
Wu (10.1016/j.sysarc.2023.102990_b161) 2022; 35
Ren (10.1016/j.sysarc.2023.102990_b37) 2023
Zhao (10.1016/j.sysarc.2023.102990_b193) 2021
10.1016/j.sysarc.2023.102990_b92
10.1016/j.sysarc.2023.102990_b90
10.1016/j.sysarc.2023.102990_b77
Kim (10.1016/j.sysarc.2023.102990_b208) 2022
10.1016/j.sysarc.2023.102990_b169
Zhang (10.1016/j.sysarc.2023.102990_b192) 2022
10.1016/j.sysarc.2023.102990_b74
10.1016/j.sysarc.2023.102990_b71
10.1016/j.sysarc.2023.102990_b162
Chen (10.1016/j.sysarc.2023.102990_b98) 2021; 34
Yao (10.1016/j.sysarc.2023.102990_b146) 2022; 35
Kim (10.1016/j.sysarc.2023.102990_b57) 2023
Luo (10.1016/j.sysarc.2023.102990_b105) 2022
10.1016/j.sysarc.2023.102990_b80
Sun (10.1016/j.sysarc.2023.102990_b28) 2021
Hoffmann (10.1016/j.sysarc.2023.102990_b12) 2022
10.1016/j.sysarc.2023.102990_b67
10.1016/j.sysarc.2023.102990_b153
10.1016/j.sysarc.2023.102990_b158
10.1016/j.sysarc.2023.102990_b156
10.1016/j.sysarc.2023.102990_b157
Singhal (10.1016/j.sysarc.2023.102990_b31) 2022
Wang (10.1016/j.sysarc.2023.102990_b171) 2020
Zhang (10.1016/j.sysarc.2023.102990_b207) 2022
Hou (10.1016/j.sysarc.2023.102990_b76) 2020; 33
Taylor (10.1016/j.sysarc.2023.102990_b14) 2022
Dettmers (10.1016/j.sysarc.2023.102990_b131) 2022
Tian (10.1016/j.sysarc.2023.102990_b163) 2023
Devlin (10.1016/j.sysarc.2023.102990_b62) 2019
Wang (10.1016/j.sysarc.2023.102990_b73) 2020; 33
Zhang (10.1016/j.sysarc.2023.102990_b97) 2022
Choi (10.1016/j.sysarc.2023.102990_b139) 2018
Brown (10.1016/j.sysarc.2023.102990_b18) 2020
Wu (10.1016/j.sysarc.2023.102990_b27) 2021
Kuzmin (10.1016/j.sysarc.2023.102990_b168) 2022
Zhu (10.1016/j.sysarc.2023.102990_b91) 2021
Thoppilan (10.1016/j.sysarc.2023.102990_b16) 2022
Liu (10.1016/j.sysarc.2023.102990_b142) 2021
Yu (10.1016/j.sysarc.2023.102990_b100) 2023; 66
So (10.1016/j.sysarc.2023.102990_b199) 2021; 34
Sanh (10.1016/j.sysarc.2023.102990_b47) 2019
Chen (10.1016/j.sysarc.2023.102990_b202) 2021; 34
Zafrir (10.1016/j.sysarc.2023.102990_b79) 2021
Chitty-Venkata (10.1016/j.sysarc.2023.102990_b187) 2022; 55
Hofstätter (10.1016/j.sysarc.2023.102990_b75) 2020
Wang (10.1016/j.sysarc.2023.102990_b206) 2022
Yang (10.1016/j.sysarc.2023.102990_b204) 2022
Zhang (10.1016/j.sysarc.2023.102990_b22) 2022
You (10.1016/j.sysarc.2023.102990_b248) 2023
Ganesh (10.1016/j.sysarc.2023.102990_b102) 2021; 9
Lu (10.1016/j.sysarc.2023.102990_b175) 2022
Frantar (10.1016/j.sysarc.2023.102990_b66) 2023
10.1016/j.sysarc.2023.102990_b196
Glaese (10.1016/j.sysarc.2023.102990_b11) 2022
Kim (10.1016/j.sysarc.2023.102990_b150) 2021
10.1016/j.sysarc.2023.102990_b191
10.1016/j.sysarc.2023.102990_b194
Mueller (10.1016/j.sysarc.2023.102990_b267) 2021
Dehghani (10.1016/j.sysarc.2023.102990_b10) 2023
Yuan (10.1016/j.sysarc.2023.102990_b125) 2022
Liu (10.1016/j.sysarc.2023.102990_b189) 2022
You (10.1016/j.sysarc.2023.102990_b195) 2022
Ni (10.1016/j.sysarc.2023.102990_b212) 2022
Zhang (10.1016/j.sysarc.2023.102990_b118) 2021; 20
Darvish Rouhani (10.1016/j.sysarc.2023.102990_b165) 2020
Maaz (10.1016/j.sysarc.2023.102990_b172) 2022
Dufter (10.1016/j.sysarc.2023.102990_b58) 2022; 48
Liu (10.1016/j.sysarc.2023.102990_b128) 2021; 34
10.1016/j.sysarc.2023.102990_b220
Tay (10.1016/j.sysarc.2023.102990_b34) 2022
10.1016/j.sysarc.2023.102990_b221
10.1016/j.sysarc.2023.102990_b103
10.1016/j.sysarc.2023.102990_b104
Rao (10.1016/j.sysarc.2023.102990_b112) 2021
10.1016/j.sysarc.2023.102990_b225
10.1016/j.sysarc.2023.102990_b101
10.1016/j.sysarc.2023.102990_b222
10.1016/j.sysarc.2023.102990_b17
Huang (10.1016/j.sysarc.2023.102990_b233) 2022
10.1016/j.sysarc.2023.102990_b15
10.1016/j.sysarc.2023.102990_b13
Muennighoff (10.1016/j.sysarc.2023.102990_b24) 2022
Fournier (10.1016/j.sysarc.2023.102990_b56) 2023; 55
Kalamkar (10.1016/j.sysarc.2023.102990_b164) 2019
Wang (10.1016/j.sysarc.2023.102990_b159) 2018
Chitty-Venkata (10.1016/j.sysarc.2023.102990_b188) 2022; 10
10.1016/j.sysarc.2023.102990_b107
Gomez (10.1016/j.sysarc.2023.102990_b228) 2017
10.1016/j.sysarc.2023.102990_b108
Choquette (10.1016/j.sysarc.2023.102990_b166) 2021; 41
10.1016/j.sysarc.2023.102990_b229
Vaswani (10.1016/j.sysarc.2023.102990_b2) 2017
van Baalen (10.1016/j.sysarc.2023.102990_b167) 2023
Zhang (10.1016/j.sysarc.2023.102990_b26) 2021; 2
Lewkowycz (10.1016/j.sysarc.2023.102990_b33) 2022; 35
Su (10.1016/j.sysarc.2023.102990_b190) 2022
10.1016/j.sysarc.2023.102990_b210
10.1016/j.sysarc.2023.102990_b213
10.1016/j.sysarc.2023.102990_b211
Chitty-Venkata (10.1016/j.sysarc.2023.102990_b259) 2023
Xie (10.1016/j.sysarc.2023.102990_b254) 2021; 34
Agarap (10.1016/j.sysarc.2023.102990_b59) 2018
Carion (10.1016/j.sysarc.2023.102990_b8) 2020
Zhu (10.1016/j.sysarc.2023.102990_b214) 2021
Mellor (10.1016/j.sysarc.2023.102990_b224) 2021
Liang (10.1016/j.sysarc.2023.102990_b72) 2023
Iyer (10.1016/j.sysarc.2023.102990_b20) 2022
Dwivedi (10.1016/j.sysarc.2023.102990_b263) 2020
Bengio (10.1016/j.sysarc.2023.102990_b137) 2013
Kwon (10.1016/j.sysarc.2023.102990_b114) 2022; 35
10.1016/j.sysarc.2023.102990_b217
10.1016/j.sysarc.2023.102990_b218
You (10.1016/j.sysarc.2023.102990_b236) 2020; 33
10.1016/j.sysarc.2023.102990_b215
10.1016/j.sysarc.2023.102990_b216
10.1016/j.sysarc.2023.102990_b219
Radford (10.1016/j.sysarc.2023.102990_b64) 2019; 1
Radford (10.1016/j.sysarc.2023.102990_b63) 2018
10.1016/j.sysarc.2023.102990_b203
10.1016/j.sysarc.2023.102990_b200
Mehta (10.1016/j.sysarc.2023.102990_b173) 2022
Mao (10.1016/j.sysarc.2023.102990_b69) 2021; 5
Mishra (10.1016/j.sysarc.2023.102990_b86) 2021
Shuster (10.1016/j.sysarc.2023.102990_b23) 2022
Chowdhery (10.1016/j.sysarc.2023.102990_b32) 2022
Ying (10.1016/j.sysarc.2023.102990_b260) 2019
Wang (10.1016/j.sysarc.2023.102990_b127) 2022
Du (10.1016/j.sysarc.2023.102990_b39) 2022
Holmes (10.1016/j.sysarc.2023.102990_b88) 2021; 34
10.1016/j.sysarc.2023.102990_b205
Scao (10.1016/j.sysarc.2023.102990_b115) 2022
Liao (10.1016/j.sysarc.2023.102990_b201) 2021
10.1016/j.sysarc.2023.102990_b209
Chung (10.1016/j.sysarc.2023.102990_b35) 2022
Campos (10.1016/j.sysarc.2023.102990_b96) 2022
Chuanyang (10.1016/j.sysarc.2023.102990_b65) 2022
Rae (10.1016/j.sysarc.2023.102990_b29) 2021
Guo (10.1016/j.sysarc.2023.102990_b266) 2021; 7
Li (10.1016/j.sysarc.2023.102990_b122) 2020
10.1016/j.sysarc.2023.102990_b264
10.1016/j.sysarc.2023.102990_b144
10.1016/j.sysarc.2023.102990_b265
10.1016/j.sysarc.2023.102990_b53
10.1016/j.sysarc.2023.102990_b141
10.1016/j.sysarc.2023.102990_b51
10.1016/j.sysarc.2023.102990_b147
10.1016/j.sysarc.2023.102990_b268
10.1016/j.sysarc.2023.102990_b148
10.1016/j.sysarc.2023.102990_b50
Li (10.1016/j.sysarc.2023.102990_b223) 2020
10.1016/j.sysarc.2023.102990_b140
10.1016/j.sysarc.2023.102990_b261
Chen (10.1016/j.sysarc.2023.102990_b82) 2020; 32
Tolstikhin (10.1016/j.sysarc.2023.102990_b226) 2021; 34
Ye (10.1016/j.sysarc.2023.102990_b244) 2022
10.1016/j.sysarc.2023.102990_b149
Hinton (10.1016/j.sysarc.2023.102990_b45) 2015
10.1016/j.sysarc.2023.102990_b132
Li (10.1016/j.sysarc.2023.102990_b9) 2023
10.1016/j.sysarc.2023.102990_b133
10.1016/j.sysarc.2023.102990_b251
10.1016/j.sysarc.2023.102990_b43
Pan (10.1016/j.sysarc.2023.102990_b109) 2021; 34
10.1016/j.sysarc.2023.102990_b136
10.1016/j.sysarc.2023.102990_b257
10.1016/j.sysarc.2023.102990_b41
Chen (10.1016/j.sysarc.2023.102990_b181) 2020
Howard (10.1016/j.sysarc.2023.102990_b185) 2017
10.1016/j.sysarc.2023.102990_b258
10.1016/j.sysarc.2023.102990_b134
10.1016/j.sysarc.2023.102990_b255
10.1016/j.sysarc.2023.102990_b135
10.1016/j.sysarc.2023.102990_b256
10.1016/j.sysarc.2023.102990_b49
10.1016/j.sysarc.2023.102990_b250
10.1016/j.sysarc.2023.102990_b46
Ouyang (10.1016/j.sysarc.2023.102990_b21) 2022; 35
Cheong (10.1016/j.sysarc.2023.102990_b84) 2019
Tay (10.1016/j.sysarc.2023.102990_b55) 2020
Michel (10.1016/j.sysarc.2023.102990_b48) 2019
Sanh (10.1016/j.sysarc.2023.102990_b81) 2020; 33
Zhang (10.1016/j.sysarc.2023.102990_b230) 2021
Frumkin (10.1016/j.sysarc.2023.102990_b130) 2022
So (10.1016/j.sysarc.2023.102990_b198) 2019
Krishnamoorthi (10.1016/j.sysarc.2023.102990_b44) 2018
10.1016/j.sysarc.2023.102990_b121
10.1016/j.sysarc.2023.102990_b242
Yang (10.1016/j.sysarc.2023.102990_b106) 2022
10.1016/j.sysarc.2023.102990_b243
10.1016/j.sysarc.2023.102990_b240
Research (10.1016/j.sysarc.2023.102990_b1) 2022
10.1016/j.sysarc.2023.102990_b120
10.1016/j.sysarc.2023.102990_b246
10.1016/j.sysarc.2023.102990_b126
Zhu (10.1016/j.sysarc.2023.102990_b95) 2017
10.1016/j.sysarc.2023.102990_b124
10.1016/j.sysarc.2023.102990_b245
Wang (10.1016/j.sysarc.2023.102990_b239) 2022
Li (10.1016/j.sysarc.2023.102990_b143) 2022
10.1016/j.sysarc.2023.102990_b38
Yang (10.1016/j.sysarc.2023.102990_b252) 2022; 11
Mittal (10.1016/j.sysarc.2023.102990_b269) 2020; 104
Latifi (10.1016/j.sysarc.2023.102990_b197) 2022
Zhao (10.1016/j.sysarc.2023.102990_b247) 2022
Hsu
References_xml – start-page: 10347
  year: 2021
  end-page: 10357
  ident: b52
  article-title: Training data-efficient image transformers & distillation through attention
  publication-title: International Conference on Machine Learning
– reference: C. Zhao, T. Hua, Y. Shen, Q. Lou, H. Jin, Automatic Mixed-Precision Quantization Search of BERT, in: IJCAI, 2021.
– start-page: 3
  year: 2022
  end-page: 20
  ident: b172
  article-title: Edgenext: Efficiently amalgamated CNN-transformer architecture for mobile vision applications
  publication-title: European Conference on Computer Vision
– start-page: 273
  year: 2023
  end-page: 286
  ident: b248
  article-title: ViTCoD: Vision transformer acceleration via dedicated algorithm and accelerator co-design
  publication-title: 2023 IEEE International Symposium on High-Performance Computer Architecture
– reference: B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, J.D. Hyeon, S. Park, S. Kim, S. Kim, D. Seo, et al., What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3405–3424.
– start-page: 351
  year: 2020
  end-page: 367
  ident: b181
  article-title: Dynamic ReLU
  publication-title: European Conference on Computer Vision
– year: 2022
  ident: b207
  article-title: AutoDistill: An end-to-end framework to explore and distill hardware-efficient language models
– reference: F. Lagunas, E. Charlaix, V. Sanh, A.M. Rush, Block Pruning For Faster Transformers, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 10619–10629.
– year: 2019
  ident: b62
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
  publication-title: NAACL-HLT
– start-page: 38087
  year: 2023
  end-page: 38099
  ident: b138
  article-title: Smoothquant: Accurate and efficient post-training quantization for large language models
  publication-title: International Conference on Machine Learning
– reference: H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all: Train one network and specialize it for efficient deployment, in: International Conference on Learning Representations, 2019.
– reference: M. Xia, Z. Zhong, D. Chen, Structured Pruning Learns Compact and Accurate Models, in: 60th Annual Meeting of the Association for Computational Linguistics, Vol. 1, 2022, pp. 1513–1528.
– volume: 34
  start-page: 24898
  year: 2021
  end-page: 24911
  ident: b109
  article-title: IA-RED2: Interpretability-aware redundancy reduction for vision transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b12
  article-title: Training compute-optimal large language models
– year: 2022
  ident: b175
  article-title: TFormer: A transmission-friendly ViT model for IoT devices
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– year: 2023
  ident: b259
  article-title: Neural architecture search benchmarks: Insights and survey
  publication-title: IEEE Access
– start-page: 1
  year: 2021
  end-page: 5
  ident: b267
  article-title: Spiking transformer networks: A rate coded approach for processing sequential data
  publication-title: 2021 7th International Conference on Systems and Informatics
– volume: 33
  start-page: 20378
  year: 2020
  end-page: 20389
  ident: b81
  article-title: Movement pruning: Adaptive sparsity by fine-tuning
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b35
  article-title: Scaling instruction-finetuned language models
– start-page: 7105
  year: 2019
  end-page: 7114
  ident: b260
  article-title: NAS-bench-101: Towards reproducible neural architecture search
  publication-title: International Conference on Machine Learning
– volume: 9
  start-page: 1061
  year: 2021
  end-page: 1080
  ident: b102
  article-title: Compressing large-scale transformer-based models: A case study on BERT
  publication-title: Trans. Assoc. Comput. Linguist.
– start-page: 1
  year: 2022
  end-page: 3
  ident: b239
  article-title: A 28nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing
  publication-title: 2022 IEEE International Solid-State Circuits Conference, Vol. 65
– reference: M.S. Abdelfattah, A. Mehrotra, Ł. Dudziak, N.D. Lane, Zero-cost proxies for lightweight NAS, in: ICLR, 2021.
– start-page: 1
  year: 2022
  end-page: 6
  ident: b233
  article-title: An automatic and efficient BERT pruning for edge AI systems
  publication-title: 2022 23rd International Symposium on Quality Electronic Design
– start-page: 692
  year: 2021
  end-page: 705
  ident: b54
  article-title: ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks
  publication-title: Annual International Symposium on Computer Architecture
– year: 2018
  ident: b159
  article-title: GLUE: A multi-task benchmark and analysis platform for natural language understanding
– reference: C. Li, Z. Yu, Y. Fu, Y. Zhang, Y. Zhao, H. You, Q. Yu, Y. Wang, Y. Lin, HW-NAS-Bench: Hardware-aware neural architecture search benchmark, in: ICLR, 2021.
– volume: 34
  start-page: 28092
  year: 2021
  end-page: 28103
  ident: b128
  article-title: Post-training quantization for vision transformer
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: E. Xie, W. Wang, W. Wang, P. Sun, H. Xu, D. Liang, P. Luo, Segmenting transparent object in the wild with transformer, in: International Joint Conference on Artificial Intelligence, 2021.
– reference: Y. Wang, Y. Yang, Y. Chen, J. Bai, C. Zhang, G. Su, X. Kou, Y. Tong, M. Yang, L. Zhou, TextNAS: A neural architecture search space tailored for text representation, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 05, 2020, pp. 9242–9249.
– year: 2022
  ident: b22
  article-title: OPT: Open pre-trained transformer language models
– year: 2022
  ident: b129
  article-title: PSAQ-ViT V2: Towards accurate and general data-free quantization for vision transformers
– reference: J. Wei, M. Bosma, V.Y. Zhao, K. Guu, A.W. Yu, B. Lester, N. Du, A.M. Dai, Q.V. Le, Finetuned language models are zero-shot learners, in: International Conference on Learning Representations, 2021.
– year: 2018
  ident: b63
  article-title: Improving Language Understanding by Generative Pre-Training
– volume: 20
  start-page: 1
  year: 2021
  end-page: 24
  ident: b118
  article-title: Algorithm-hardware co-design of attention mechanism on FPGA devices
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
– reference: S. Lu, M. Wang, S. Liang, J. Lin, Z. Wang, Hardware accelerator for multi-head attention and position-wise feed-forward in the transformer, in: 2020 IEEE 33rd International System-on-Chip Conference, SOCC, 2020, pp. 84–89.
– start-page: 9010
  year: 2022
  end-page: 9023
  ident: b65
  article-title: SAViT: Structure-aware vision transformer pruning via collaborative optimization
  publication-title: Advances in Neural Information Processing Systems, Vol. 35
– start-page: 205
  year: 2022
  end-page: 218
  ident: b262
  article-title: Swin-Unet: Unet-like pure transformer for medical image segmentation
  publication-title: European Conference on Computer Vision
– reference: B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, K. Keutzer, FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10734–10742.
– reference: C. Yang, Y. Wang, J. Zhang, H. Zhang, Z. Wei, Z. Lin, A. Yuille, Lite vision transformer with enhanced self-attention, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11998–12008.
– reference: W. Chen, W. Huang, X. Du, X. Song, Z. Wang, D. Zhou, Auto-scaling Vision Transformers without Training, in: ICLR, 2022.
– reference: H. Wang, Z. Zhang, S. Han, SpAtten: Efficient sparse attention architecture with cascade token and head pruning, in: IEEE International Symposium on High-Performance Computer Architecture, HPCA, 2021, pp. 97–110.
– year: 2017
  ident: b228
  article-title: The reversible residual network: Backpropagation without storing activations
  publication-title: Advances in Neural Information Processing Systems, Vol. 30
– year: 2016
  ident: b42
  article-title: Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients
– start-page: 25055
  year: 2022
  end-page: 25069
  ident: b204
  article-title: Searching for BurgerFormer with micro-meso-macro space design
  publication-title: International Conference on Machine Learning
– start-page: 1877
  year: 2020
  end-page: 1901
  ident: b18
  article-title: Language models are few-shot learners
  publication-title: Advances in Neural Information Processing Systems, Vol. 33
– reference: A. Nagarajan, S. Sen, J.R. Stevens, A. Raghunathan, AxFormer: Accuracy-driven Approximation of Transformers for Faster, Smaller and more Accurate NLP Models, in: International Joint Conference on Neural Networks, IJCNN, 2022, pp. 1–8.
– year: 2022
  ident: b237
  article-title: Enabling and accelerating dynamic vision transformer inference for real-time applications
– year: 2022
  ident: b173
  article-title: Separable self-attention for mobile vision transformers
  publication-title: Trans. Mach. Learn. Res.
– reference: K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
– year: 2023
  ident: b9
  article-title: Making AI less” thirsty”: Uncovering and addressing the secret water footprint of AI models
– year: 2021
  ident: b270
  article-title: A survey on hardware security of DNN models and accelerators
  publication-title: J. Syst. Archit.
– reference: H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, I. King, BinaryBERT: Pushing the Limit of BERT Quantization, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1, 2021, pp. 4334–4348.
– reference: Z. Dong, Z. Yao, A. Gholami, M.W. Mahoney, K. Keutzer, HAWQ: Hessian aware quantization of neural networks with mixed-precision, in: IEEE/CVF International Conference on Computer Vision, 2019, pp. 293–302.
– year: 2022
  ident: b14
  article-title: Galactica: A large language model for science
– reference: Y. Bondarenko, M. Nagel, T. Blankevoort, Understanding and Overcoming the Challenges of Efficient Transformer Quantization, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7947–7969.
– volume: 35
  start-page: 3843
  year: 2022
  end-page: 3857
  ident: b33
  article-title: Solving quantitative reasoning problems with language models
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: C. Gong, D. Wang, M. Li, X. Chen, Z. Yan, Y. Tian, V. Chandra, et al., NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training, in: International Conference on Learning Representations, 2021.
– year: 2021
  ident: b19
  article-title: WebGPT: Browser-assisted question-answering with human feedback
– reference: S. Huang, S. Chen, H. Peng, D. Manu, Z. Kong, G. Yuan, L. Yang, S. Wang, H. Liu, C. Ding, HMC-Tran: A Tensor-core Inspired Hierarchical Model Compression for Transformer-based DNNs on GPU, in: Great Lakes Symposium on VLSI, 2021, pp. 169–174.
– reference: W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578.
– year: 2021
  ident: b79
  article-title: Prune once for all: Sparse pre-trained language models
– start-page: 154
  year: 2022
  end-page: 170
  ident: b143
  article-title: Patch similarity aware data-free quantization for vision transformers
  publication-title: European Conference on Computer Vision
– reference: A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, A. Vaswani, Bottleneck transformers for visual recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16519–16529.
– volume: 35
  start-page: 27168
  year: 2022
  end-page: 27183
  ident: b146
  article-title: Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: H. Qin, R. Gong, X. Liu, M. Shen, Z. Wei, F. Yu, J. Song, Forward and backward information retention for accurate binary neural networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2250–2259.
– reference: S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M.W. Mahoney, K. Keutzer, Q-BERT: Hessian based ultra low precision quantization of BERT, in: AAAI Conference on Artificial Intelligence, Vol. 34, No. 05, 2020, pp. 8815–8821.
– reference: Z. Liu, F. Li, G. Li, J. Cheng, EBERT: Efficient BERT Inference with Dynamic Structured Pruning, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, pp. 4814–4823.
– year: 2022
  ident: b115
  article-title: BLOOM: A 176B-parameter open-access multilingual language model
– start-page: 513
  year: 2021
  end-page: 516
  ident: b142
  article-title: Hardware acceleration of fully quantized bert for efficient natural language processing
  publication-title: 2021 Design, Automation & Test in Europe Conference & Exhibition
– start-page: 213
  year: 2020
  end-page: 229
  ident: b8
  article-title: End-to-end object detection with transformers
  publication-title: European Conference on Computer Vision
– year: 2022
  ident: b23
  article-title: BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage
– year: 2022
  ident: b168
  article-title: FP8 quantization: The power of the exponent
– reference: G. BARD, 2023. URL
– reference: M.A. Gordon, K. Duh, N. Andrews, Compressing BERT: Studying the effects of weight pruning on transfer learning, in: 5th Workshop on Representation Learning for NLP, 2020.
– reference: C. Fang, S. Guo, W. Wu, J. Lin, Z. Wang, M.K. Hsu, L. Liu, An Efficient Hardware Accelerator for Sparse Transformer Neural Networks, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 2670–2674.
– reference: Y. Yin, C. Chen, L. Shang, X. Jiang, X. Chen, Q. Liu, AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Volume 1: Long Papers, 2021, pp. 5146–5157.
– reference: K. Wang, Z. Liu, Y. Lin, J. Lin, S. Han, HAQ: Hardware-aware automated quantization with mixed precision, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8612–8620.
– volume: 55
  start-page: 1
  year: 2022
  end-page: 36
  ident: b187
  article-title: Neural architecture search survey: A hardware perspective
  publication-title: ACM Comput. Surv.
– reference: X. Dong, C. Long, W. Xu, C. Xiao, Dual graph convolutional networks with transformer and curriculum learning for image captioning, in: Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 2615–2624.
– reference: M. Ji, B. Heo, S. Park, Show, attend and distill: Knowledge distillation via attention-based feature matching, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 9, 2021, pp. 7945–7952.
– reference: L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.-H. Jiang, F.E. Tay, J. Feng, S. Yan, Tokens-to-token ViT: Training vision transformers from scratch on imagenet, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 558–567.
– reference: D. Ma, X. Qin, X. Jiao, AxBy-ViT: Reconfigurable Approximate Computation Bypass for Vision Transformers, in: International Symposium on Quality Electronic Design, ISQED, 2022, pp. 1–5.
– start-page: 1
  year: 2023
  end-page: 5
  ident: b163
  article-title: BEBERT: Efficient and robust binary ensemble BERT
  publication-title: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing
– reference: C. Cao, Y. Zhang, Y. Wu, H. Lu, J. Cheng, Egocentric gesture recognition using recurrent 3D convolutional neural networks with spatiotemporal transformer modules, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3763–3771.
– year: 2022
  ident: b30
  article-title: Using DeepSpeed and megatron to train megatron-turing NLG 530B, A large-scale generative language model
– start-page: 10271
  year: 2020
  end-page: 10281
  ident: b165
  article-title: Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point
  publication-title: Advances in Neural Information Processing Systems, Vol. 33
– reference: H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search, in: International Conference on Learning Representations, 2018.
– year: 2018
  ident: b59
  article-title: Deep learning using rectified linear units (ReLU)
– year: 2015
  ident: b40
  article-title: Learning both weights and connections for efficient neural network
  publication-title: Advances in Neural Information Processing Systems, Vol. 28
– volume: 33
  start-page: 2771
  year: 2020
  end-page: 2783
  ident: b236
  article-title: ShiftAddNet: A hardware-inspired deep network
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 4163
  year: 2020
  end-page: 4174
  ident: b68
  article-title: TinyBERT: Distilling BERT for natural language understanding
  publication-title: Findings of the Association for Computational Linguistics
– reference: Z. Qin, W. Sun, H. Deng, D. Li, Y. Wei, B. Lv, J. Yan, L. Kong, Y. Zhong, CosFormer: Rethinking softmax in attention, in: ICLR, 2022.
– volume: 35
  start-page: 24101
  year: 2022
  end-page: 24116
  ident: b114
  article-title: A fast post-training pruning framework for transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: A.H. Zadeh, I. Edo, O.M. Awad, A. Moshovos, GOBO: Quantizing attention-based NLP models for low latency and energy efficient inference, in: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, 2020, pp. 811–824.
– reference: B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, C. Ding, FTRANS: Energy-efficient acceleration of transformers using FPGA, in: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, 2020, pp. 175–180.
– volume: 1
  start-page: 9
  year: 2019
  ident: b64
  article-title: Language models are unsupervised multitask learners
  publication-title: OpenAI blog
– year: 2020
  ident: b75
  article-title: Improving efficient neural ranking models with cross-architecture knowledge distillation
– reference: J. Xu, X. Tan, R. Luo, K. Song, J. Li, T. Qin, T.-Y. Liu, NAS-BERT: task-agnostic and adaptive-size BERT compression with neural architecture search, in: 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1933–1943.
– start-page: 14
  year: 2022
  end-page: 26
  ident: b241
  article-title: DOTA: Detect and omit weak attentions for scalable transformer acceleration
  publication-title: ASPLOS
– reference: S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P.H. Torr, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890.
– reference: Y. Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, X. Sun, Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, No. 3, 2022, pp. 2964–2972.
– start-page: 599
  year: 2022
  end-page: 615
  ident: b117
  article-title: Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design
  publication-title: 2022 55th IEEE/ACM International Symposium on Microarchitecture
– year: 2019
  ident: b47
  article-title: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
– year: 2023
  ident: b72
  article-title: HomoDistil: Homotopic task-agnostic distillation of pre-trained transformers
– volume: 33
  start-page: 9782
  year: 2020
  end-page: 9793
  ident: b76
  article-title: DynaBERT: Dynamic BERT with adaptive width and depth
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: S.K. Esser, J.L. McKinstry, D. Bablani, R. Appuswamy, D.S. Modha, Learned step size quantization, in: International Conference on Learning Representations, 2019.
– volume: 11
  start-page: 3550
  year: 2022
  ident: b252
  article-title: EFA-Trans: An efficient and flexible acceleration architecture for transformers
  publication-title: Electronics
– volume: 35
  start-page: 3217
  year: 2022
  end-page: 3231
  ident: b161
  article-title: XTC: Extreme compression for pre-trained transformers made simple and efficient
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b3
  article-title: ChatGPT: Optimizing Language Models for Dialogue
– year: 2019
  ident: b155
  article-title: Roberta: A robustly optimized bert pretraining approach
– year: 2022
  ident: b20
  article-title: OPT-IML: Scaling language model instruction meta learning through the lens of generalization
– volume: 2
  start-page: 216
  year: 2021
  end-page: 224
  ident: b26
  article-title: CPM-2: Large-scale cost-effective pre-trained language models
  publication-title: AI Open
– year: 2018
  ident: b44
  article-title: Quantizing deep convolutional networks for efficient inference: A whitepaper
– start-page: 7480
  year: 2023
  end-page: 7512
  ident: b10
  article-title: Scaling vision transformers to 22 billion parameters
  publication-title: International Conference on Machine Learning
– start-page: 11875
  year: 2021
  end-page: 11886
  ident: b152
  article-title: HAWQ-v3: Dyadic neural network quantization
  publication-title: International Conference on Machine Learning
– reference: H. Chen, Y. Wang, C. Xu, B. Shi, C. Xu, Q. Tian, C. Xu, AdderNet: Do we really need multiplications in deep learning?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1468–1477.
– volume: 34
  start-page: 19974
  year: 2021
  end-page: 19988
  ident: b98
  article-title: Chasing sparsity in vision transformers: An end-to-end exploration
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: P. Qi, Y. Song, H. Peng, S. Huang, Q. Zhuge, E.H.-M. Sha, Accommodating transformer onto FPGA: Coupling the balanced model compression and FPGA-implementation optimization, in: Great Lakes Symposium on VLSI, 2021, pp. 163–168.
– year: 2022
  ident: b131
  article-title: Llm.int8: 8-bit matrix multiplication for transformers at scale
– volume: 34
  start-page: 6010
  year: 2021
  end-page: 6022
  ident: b199
  article-title: Searching for efficient transformers for language modeling
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: S. Jung, C. Son, S. Lee, J. Son, J.-J. Han, Y. Kwak, S.J. Hwang, C. Choi, Learning to quantize deep networks by optimizing quantization intervals with task loss, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4350–4359.
– start-page: 139
  year: 2022
  end-page: 157
  ident: b190
  article-title: ViTAS: Vision transformer architecture search
  publication-title: European Conference on Computer Vision
– reference: C. White, W. Neiswanger, Y. Savani, BANANAS: Bayesian optimization with neural architectures for neural architecture search, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 12, 2021, pp. 10293–10301.
– year: 2022
  ident: b130
  article-title: CPT-V: A contrastive approach to post-training quantization of vision transformers
– reference: Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch slimming for efficient vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12165–12174.
– year: 2021
  ident: b91
  article-title: Vision transformer pruning
– year: 2019
  ident: b84
  article-title: Transformers. Zip: Compressing Transformers with Pruning and Quantization
– year: 2023
  ident: b66
  article-title: Massive language models can be accurately pruned in one-shot
– year: 2021
  ident: b201
  article-title: Searching for efficient multi-stage vision transformers
– reference: A. Chavan, Z. Shen, Z. Liu, Z. Liu, K.-T. Cheng, E.P. Xing, Vision transformer slimming: Multi-dimension searching in continuous optimization space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4931–4941.
– reference: X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, in: International Conference on Learning Representations, 2020.
– reference: J. Gao, H. Xu, H. Shi, X. Ren, L. Philip, X. Liang, X. Jiang, Z. Li, AutoBERT-zero: Evolving BERT backbone from scratch, in: AAAI Conference on Artificial Intelligence, Vol. 36, No. 10, 2022, pp. 10663–10671.
– year: 2021
  ident: b27
  article-title: Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning
– reference: Q. Zhou, K. Sheng, X. Zheng, K. Li, X. Sun, Y. Tian, J. Chen, R. Ji, Training-free transformer architecture search, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10894–10903.
– reference: J. Choi, S. Venkataramani, V.V. Srinivasan, K. Gopalakrishnan, Z. Wang, P. Chuang, Accurate and efficient 2-bit quantized neural networks, in: Proceedings of Machine Learning and Systems, Vol. 1, 2019, pp. 348–359.
– year: 2022
  ident: b234
  article-title: Neural architecture search on efficient transformers and beyond
– year: 2022
  ident: b247
  article-title: An FPGA-Based transformer accelerator using output block stationary dataflow for object recognition applications
  publication-title: IEEE Trans. Circuits Syst. II
– year: 2020
  ident: b171
  article-title: Linformer: Self-attention with linear complexity
– reference: M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, MobileNetV2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
– year: 2023
  ident: b167
  article-title: FP8 versus INT8 for efficient deep learning inference
– year: 2016
  ident: b61
  article-title: Layer normalization
– volume: 7
  start-page: 187
  year: 2021
  end-page: 199
  ident: b266
  article-title: PCT: Point cloud transformer
  publication-title: Comput. Vis. Media
– start-page: 367
  year: 2020
  end-page: 377
  ident: b223
  article-title: Random search and reproducibility for neural architecture search
  publication-title: Uncertainty in Artificial Intelligence
– year: 2017
  ident: b95
  article-title: To prune, or not to prune: exploring the efficacy of pruning for model compression
– start-page: 14303
  year: 2022
  end-page: 14316
  ident: b160
  article-title: Bit: Robustly binarized multi-distilled transformer
  publication-title: Advances in Neural Information Processing Systems, Vol. 35
– reference: S. Hong, S. Moon, J. Kim, S. Lee, M. Kim, D. Lee, J.-Y. Kim, DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation, in: IEEE/ACM International Symposium on Microarchitecture, MICRO, 2022, pp. 616–630.
– reference: .
– reference: J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, A. Moshovos, Bit-pragmatic deep neural network computing, in: IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 382–394.
– year: 2022
  ident: b34
  article-title: Transcending scaling laws with 0.1% extra compute
– reference: Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
– start-page: 3187
  year: 2020
  end-page: 3199
  ident: b122
  article-title: Efficient transformer-based large scale language representations using hardware-friendly block structured pruning
  publication-title: Findings of the Association for Computational Linguistics: EMNLP 2020
– reference: H. Benmeziane, H. Ouarnoughi, K.E. Maghraoui, S. Niar, Real-time style transfer with efficient vision transformers, in: Proceedings of the 5th International Workshop on Edge Systems, Analytics and Networking, 2022, pp. 31–36.
– year: 2015
  ident: b45
  article-title: Distilling the knowledge in a neural network
– reference: A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, in: ICLR, 2023.
– volume: 104
  year: 2020
  ident: b269
  article-title: A survey on modeling and improving reliability of DNN algorithms and accelerators
  publication-title: J. Syst. Archit.
– year: 2018
  ident: b139
  article-title: PACT: Parameterized clipping activation for quantized neural networks
– reference: N. Kitaev, Ł. Kaiser, A. Levskaya, Reformer: The efficient transformer, in: International Conference on Learning Representations, 2020.
– year: 2017
  ident: b185
  article-title: Mobilenets: Efficient convolutional neural networks for mobile vision applications
– start-page: 5877
  year: 2019
  end-page: 5886
  ident: b198
  article-title: The evolved transformer
  publication-title: International Conference on Machine Learning
– year: 2022
  ident: b192
  article-title: Vision transformer with convolutions architecture search
– volume: 34
  start-page: 24261
  year: 2021
  end-page: 24272
  ident: b226
  article-title: MLP-Mixer: An all-MLP architecture for vision
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b32
  article-title: PaLM: Scaling language modeling with Pathways
– reference: M. Chen, H. Peng, J. Fu, H. Ling, AutoFormer: Searching transformers for visual recognition, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 12270–12280.
– reference: H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, S. Han, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7675–7688.
– year: 2022
  ident: b110
  article-title: Adaptive sparse ViT: Towards learnable adaptive token pruning by fully exploiting self-attention
– reference: Y. Lin, T. Zhang, P. Sun, Z. Li, S. Zhou, FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer, in: International Joint Conference on Artificial Intelligence, IJCAI-22, 2022, pp. 1173–1179.
– year: 2022
  ident: b206
  article-title: LightHuBERT: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT
– year: 2019
  ident: b164
  article-title: A study of BFLOAT16 for deep learning training
– start-page: 5506
  year: 2021
  end-page: 5518
  ident: b150
  article-title: I-BERT: Integer-only BERT quantization
  publication-title: International Conference on Machine Learning
– reference: X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao, DearKD: data-efficient early knowledge distillation for vision transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12052–12062.
– year: 2020
  ident: b55
  article-title: Efficient transformers: A survey
– reference: H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, C. Ding, Accelerating transformer-based deep learning models on FPGAs using column balanced block pruning, in: International Symposium on Quality Electronic Design, ISQED, 2021, pp. 142–148.
– reference: O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8BERT: Quantized 8Bit BERT, in: Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition, EMC2-NIPS, 2019, pp. 36–39.
– reference: B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization and training of neural networks for efficient integer-arithmetic-only inference, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713.
– reference: Y. Chen, G. Meng, Q. Zhang, S. Xiang, C. Huang, L. Mu, X. Wang, RENAS: Reinforced evolutionary neural architecture search, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4787–4796.
– year: 2022
  ident: b24
  article-title: Crosslingual generalization through multitask finetuning
– start-page: 442
  year: 2023
  end-page: 455
  ident: b70
  article-title: Heatvit: Hardware-efficient adaptive token pruning for vision transformers
  publication-title: 2023 IEEE International Symposium on High-Performance Computer Architecture
– reference: H. Li, J. Choi, J. Ahn, A Slice and Dice Approach to Accelerate Compound Sparse Attention on GPU, in: IEEE International Symposium on Workload Characterization, IISWC, 2022.
– reference: S. Yu, T. Chen, J. Shen, H. Yuan, J. Tan, S. Yang, J. Liu, Z. Wang, Unified visual transformer compression, in: ICLR, 2022.
– year: 2017
  ident: b123
  article-title: Block-sparse recurrent neural networks
– reference: H. Peng, S. Huang, S. Chen, B. Li, T. Geng, A. Li, W. Jiang, W. Wen, J. Bi, H. Liu, et al., A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining, in: Design Automation Conference, 2022, pp. 1135–1140.
– reference: B. Li, S. Lu, K. Xie, Z. Wang, Accelerating NLP Tasks on FPGA with Compressed BERT and a Hardware-Oriented Early Exit Method, in: IEEE Computer Society Annual Symposium on VLSI, ISVLSI, 2022, pp. 410–413.
– year: 2021
  ident: b193
  article-title: Memory-efficient differentiable transformer architecture search
– year: 2020
  ident: b227
  article-title: Real-time execution of large-scale language models on mobile
– reference: D. Chen, Y. Li, M. Qiu, Z. Wang, B. Li, B. Ding, H. Deng, J. Huang, W. Lin, J. Zhou, AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search, in: IJCAI, 2020.
– year: 2020
  ident: b36
  article-title: GShard: Scaling giant models with conditional computation and automatic sharding
– volume: 41
  start-page: 29
  year: 2021
  end-page: 35
  ident: b166
  article-title: Nvidia a100 tensor core GPU: Performance and innovation
  publication-title: IEEE Micro
– year: 2022
  ident: b244
  article-title: Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
– reference: S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, K. Keutzer, Learned token pruning for transformers, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794.
– reference: K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al., Rethinking attention with Performers, in: International Conference on Learning Representations, 2020.
– start-page: 328
  year: 2020
  end-page: 341
  ident: b145
  article-title: Â 3: Accelerating attention mechanisms in neural networks with approximation
  publication-title: 2020 IEEE International Symposium on High Performance Computer Architecture
– volume: 34
  start-page: 12077
  year: 2021
  end-page: 12090
  ident: b254
  article-title: SegFormer: Simple and efficient design for semantic segmentation with transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b127
  article-title: Deep compression of pre-trained transformer models
  publication-title: Advances in Neural Information Processing Systems
– year: 2022
  ident: b106
  article-title: DTATrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture
  publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
– volume: 35
  start-page: 12934
  year: 2022
  end-page: 12949
  ident: b179
  article-title: EfficientFormer: Vision transformers at mobilenet speed
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X.V. Lin, J. Du, S. Iyer, R. Pasunuru, et al., Efficient Large Scale Language Modeling with Mixtures of Experts, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11699–11732.
– reference: E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, D. Alistarh, The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2022.
– volume: 32
  start-page: 25
  year: 2020
  end-page: 35
  ident: b82
  article-title: Learning student networks via feature embedding
  publication-title: IEEE Trans. Neural Netw. Learn. Syst.
– volume: 29
  start-page: 3451
  year: 2021
  end-page: 3460
  ident: b231
  article-title: HuBERT: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
– start-page: 5547
  year: 2022
  end-page: 5569
  ident: b39
  article-title: GLaM: Efficient scaling of language models with mixture-of-experts
  publication-title: International Conference on Machine Learning
– start-page: 13937
  year: 2021
  end-page: 13949
  ident: b112
  article-title: DynamicViT: Efficient vision transformers with dynamic token sparsification
  publication-title: Advances in Neural Information Processing Systems, Vol. 34
– year: 2019
  ident: b48
  article-title: Are sixteen heads really better than one?
  publication-title: Advances in Neural Information Processing Systems, Vol. 32
– volume: 30
  start-page: 1573
  year: 2022
  end-page: 1586
  ident: b87
  article-title: An algorithm–hardware co-optimized framework for accelerating N:M sparse transformers
  publication-title: IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
– volume: 48
  start-page: 733
  year: 2022
  end-page: 763
  ident: b58
  article-title: Position information in transformers: An overview
  publication-title: Comput. Linguist.
– year: 2021
  ident: b28
  article-title: Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation
– year: 2021
  ident: b78
  article-title: Efficient vision transformers via fine-grained manifold distillation
– reference: X. Shi, P. Zhou, W. Chen, L. Xie, Efficient gradient-based neural architecture search for end-to-end ASR, in: Companion Publication of the 2021 International Conference on Multimodal Interaction, 2021, pp. 91–96.
– reference: R. Luo, X. Tan, R. Wang, T. Qin, J. Li, S. Zhao, E. Chen, T.-Y. Liu, LightSpeech: Lightweight and fast text to speech with neural architecture search, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5699–5703.
– volume: 55
  start-page: 1
  year: 2023
  end-page: 40
  ident: b56
  article-title: A practical survey on faster and lighter transformers
  publication-title: ACM Comput. Surv.
– start-page: 47
  year: 2022
  end-page: 61
  ident: b212
  article-title: NASformer: Neural architecture search for vision transformer
  publication-title: Asian Conference on Pattern Recognition
– reference: Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2158–2170.
– reference: Z. Song, B. Fu, F. Wu, Z. Jiang, L. Jiang, N. Jing, X. Liang, DRQ: Dynamic region-based quantization for deep neural network acceleration, in: International Symposium on Computer Architecture, ISCA, 2020, pp. 1010–1021.
– year: 2022
  ident: b197
  article-title: Efficient sparsely activated transformers
– reference: E. Iofinova, A. Peste, M. Kurtz, D. Alistarh, How well do sparse ImageNet models transfer?, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12266–12276.
– year: 2021
  ident: b29
  article-title: Scaling language models: Methods, analysis & insights from training Gopher
– year: 2020
  ident: b263
  article-title: A generalization of transformer networks to graphs
– reference: A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations, 2020.
– reference: H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, X. Liu, BiBERT: Accurate fully binarized BERT, in: ICLR, 2022.
– year: 2016
  ident: b60
  article-title: Gaussian error linear units (gelus)
– volume: 5
  start-page: 1
  year: 2021
  end-page: 22
  ident: b69
  article-title: TPrune: Efficient transformer pruning for mobile devices
  publication-title: ACM Trans. Cyber-Phys. Syst.
– volume: 35
  start-page: 27730
  year: 2022
  end-page: 27744
  ident: b21
  article-title: Training language models to follow instructions with human feedback
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: b5
  article-title: GenSLMs: Genome-Scale Language Models Reveal SARS-CoV-2 Evolutionary Dynamics
– start-page: 274
  year: 2022
  end-page: 288
  ident: b105
  article-title: An attention-based token pruning method for vision transformers
  publication-title: International Joint Conference on Rough Sets
– year: 2022
  ident: b96
  article-title: Sparse* BERT: Sparse models are robust
– start-page: 7588
  year: 2021
  end-page: 7598
  ident: b224
  article-title: Neural architecture search without training
  publication-title: International Conference on Machine Learning
– reference: M. Javaheripi, S. Shah, S. Mukherjee, T.L. Religa, C.C. Mendes, G.H. de Rosa, S. Bubeck, F. Koushanfar, D. Dey, LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models, in: AutoML Workshop, 2022.
– volume: 66
  start-page: 1
  year: 2023
  end-page: 2
  ident: b100
  article-title: A unified pruning framework for vision transformers
  publication-title: Sci. China Inf. Sci.
– reference: A. Fan, E. Grave, A. Joulin, Reducing transformer depth on demand with structured dropout, in: International Conference on Learning Representations, 2019.
– year: 2021
  ident: b230
  article-title: You only compress once: Towards effective and elastic BERT compression via exploit-explore stochastic nature gradient
– reference: L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, Y. Liang, Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture, in: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021, pp. 977–991.
– volume: 34
  start-page: 1818
  year: 2021
  end-page: 1830
  ident: b88
  article-title: NxMTransformer: Semi-structured sparsification for natural language understanding via ADMM
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: B. Chen, P. Li, C. Li, B. Li, L. Bai, C. Lin, M. Sun, J. Yan, W. Ouyang, GLiT: Neural architecture search for global and local image transformer, in: IEEE/CVF International Conference on Computer Vision, 2021, pp. 12–21.
– year: 2021
  ident: b86
  article-title: Accelerating sparse deep neural networks
– volume: 10
  start-page: 108374
  year: 2022
  end-page: 108412
  ident: b188
  article-title: Neural architecture search for transformers: A survey
  publication-title: IEEE Access
– reference: A.H. Zadeh, M. Mahmoud, A. Abdelhadi, A. Moshovos, Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models, in: Proceedings of the 49th Annual International Symposium on Computer Architecture, 2022, pp. 888–901.
– start-page: 25566
  year: 2022
  end-page: 25580
  ident: b195
  article-title: ShiftAddNAS: Hardware-inspired search for more accurate and efficient neural networks
  publication-title: International Conference on Machine Learning
– reference: G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, M. Guo, SALO: An efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences, in: Design Automation Conference, 2022, pp. 571–576.
– year: 2022
  ident: b208
  article-title: Revisiting architecture-aware knowledge distillation: Smaller models and faster search
– reference: Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, Z. Liu, Mobile-Former: Bridging mobilenet and transformer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5270–5279.
– reference: A. Adhikari, A. Ram, R. Tang, W.L. Hamilton, J. Lin, Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT, in: Proceedings of the 5th Workshop on Representation Learning for NLP, 2020, pp. 72–77.
– reference: M. Nagel, M.v. Baalen, T. Blankevoort, M. Welling, Data-free quantization through weight equalization and bias correction, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1325–1334.
– year: 2022
  ident: b11
  article-title: Improving alignment of dialogue agents via targeted human judgements
– reference: R. Rizk, D. Rizk, F. Rizk, A. Kumar, M. Bayoumi, A Resource-Saving Energy-Efficient Reconfigurable Hardware Accelerator for BERT-based Deep Neural Network Language Models using FFT Multiplication, in: IEEE International Symposium on Circuits and Systems, ISCAS, 2022, pp. 1675–1679.
– reference: A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, et al., Searching for mobilenetv3, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324.
– start-page: 169
  year: 2021
  end-page: 182
  ident: b214
  article-title: Autotrans: Automating transformer design via reinforced architecture search
  publication-title: CCF International Conference on Natural Language Processing and Chinese Computing
– reference: B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in: International Conference on Learning Representations, 2016.
– reference: F. Yu, K. Huang, M. Wang, Y. Cheng, W. Chu, L. Cui, Width & Depth Pruning for Vision Transformers, in: AAAI Conference on Artificial Intelligence, Vol. 2022, AAAI, 2022.
– volume: 128
  year: 2022
  ident: b238
  article-title: DiVIT: Algorithm and architecture co-design of differential attention in vision transformer
  publication-title: J. Syst. Archit.
– year: 2017
  ident: b2
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems, Vol. 30
– year: 2013
  ident: b137
  article-title: Estimating or propagating gradients through stochastic neurons for conditional computation
– year: 2022
  ident: b1
  article-title: Artificial intelligence (AI) market size, growth, Report 2022–2030
– start-page: 33
  year: 2022
  end-page: 49
  ident: b189
  article-title: UniNet: Unified architecture search with convolution, transformer, and MLP
  publication-title: European Conference on Computer Vision
– reference: Q. Chen, C. Sun, Z. Lu, C. Gao, Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks, in: IEEE 4th International Conference on Artificial Intelligence Circuits and Systems, AICAS, 2022, pp. 25–28.
– reference: S. Mehta, M. Rastegari, MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer, in: International Conference on Learning Representations, 2021.
– reference: S. Han, H. Mao, W.J. Dally, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, in: ICLR, 2016.
– year: 2022
  ident: b151
  article-title: I-ViT: integer-only quantization for efficient vision transformer inference
– volume: 34
  year: 2021
  ident: b202
  article-title: Searching the search space of vision transformer
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 191
  year: 2022
  end-page: 207
  ident: b125
  article-title: PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization
  publication-title: European Conference on Computer Vision
– start-page: 1
  year: 2022
  end-page: 18
  ident: b253
  article-title: PipeBERT: High-throughput BERT inference for ARM big. LITTLE multi-core processors
  publication-title: J. Signal Process. Syst.
– reference: O. Lieber, O. Sharir, B. Lenz, Y. Shoham, Jurassic-1: Technical Details and Evaluation, Vol. 1, White Paper. AI21 Labs, 2021.
– year: 2023
  ident: b154
  article-title: SwiftTron: An efficient hardware accelerator for quantized transformers
– year: 2023
  ident: b37
  article-title: PanGu-
– reference: A. Rock, A. Untether, O. Khalil, O. Shai, P. Grouchy, INT8 Transformers for Inference Acceleration, in: 36th Conference on Neural Information Processing Systems, NeurIPS, 2022.
– year: 2023
  ident: b57
  article-title: Full stack optimization of transformer inference: A survey
– reference: Z. Liu, B. Wu, W. Luo, X. Yang, W. Liu, K.-T. Cheng, Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm, in: European Conference on Computer Vision, ECCV, 2018, pp. 722–737.
– reference: E. Voita, D. Talbot, F. Moiseev, R. Sennrich, I. Titov, Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
– volume: 33
  start-page: 5776
  year: 2020
  end-page: 5788
  ident: b73
  article-title: MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 26809
  year: 2022
  end-page: 26823
  ident: b97
  article-title: Platon: Pruning large transformer models with upper confidence bound of weight importance
  publication-title: International Conference on Machine Learning
– year: 2022
  ident: b16
  article-title: Lamda: Language models for dialog applications
– year: 2022
  ident: b31
  article-title: Large language models encode clinical knowledge
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b3
– ident: 10.1016/j.sysarc.2023.102990_b176
  doi: 10.1109/CVPR46437.2021.01625
– ident: 10.1016/j.sysarc.2023.102990_b71
  doi: 10.18653/v1/2020.repl4nlp-1.10
– ident: 10.1016/j.sysarc.2023.102990_b103
  doi: 10.1145/3534678.3539260
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b127
  article-title: Deep compression of pre-trained transformer models
– volume: 34
  start-page: 19974
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b98
  article-title: Chasing sparsity in vision transformers: An end-to-end exploration
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2016
  ident: 10.1016/j.sysarc.2023.102990_b42
– ident: 10.1016/j.sysarc.2023.102990_b50
  doi: 10.1109/EMC2-NIPS53020.2019.00016
– start-page: 5877
  year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b198
  article-title: The evolved transformer
– ident: 10.1016/j.sysarc.2023.102990_b156
  doi: 10.1007/978-3-030-01267-0_44
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b192
– start-page: 26809
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b97
  article-title: Platon: Pruning large transformer models with upper confidence bound of weight importance
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b86
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b168
– start-page: 1877
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b18
  article-title: Language models are few-shot learners
– ident: 10.1016/j.sysarc.2023.102990_b119
  doi: 10.1109/SOCC49529.2020.9524802
– start-page: 1
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b233
  article-title: An automatic and efficient BERT pruning for edge AI systems
– volume: 35
  start-page: 3843
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b33
  article-title: Solving quantitative reasoning problems with language models
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b85
  doi: 10.18653/v1/2020.repl4nlp-1.18
– start-page: 13937
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b112
  article-title: DynamicViT: Efficient vision transformers with dynamic token sparsification
– ident: 10.1016/j.sysarc.2023.102990_b235
  doi: 10.1109/CVPR42600.2020.00154
– start-page: 273
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b248
  article-title: ViTCoD: Vision transformer acceleration via dedicated algorithm and accelerator co-design
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b30
– year: 2017
  ident: 10.1016/j.sysarc.2023.102990_b185
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b57
– start-page: 25566
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b195
  article-title: ShiftAddNAS: Hardware-inspired search for more accurate and efficient neural networks
– volume: 41
  start-page: 29
  issue: 2
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b166
  article-title: Nvidia a100 tensor core GPU: Performance and innovation
  publication-title: IEEE Micro
  doi: 10.1109/MM.2021.3061394
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b9
– start-page: 5547
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b39
  article-title: GLaM: Efficient scaling of language models with mixture-of-experts
– ident: 10.1016/j.sysarc.2023.102990_b158
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b230
– ident: 10.1016/j.sysarc.2023.102990_b170
– ident: 10.1016/j.sysarc.2023.102990_b256
  doi: 10.1109/CVPR46437.2021.00681
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b20
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b27
– start-page: 38087
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b138
  article-title: Smoothquant: Accurate and efficient post-training quantization for large language models
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b164
– ident: 10.1016/j.sysarc.2023.102990_b177
  doi: 10.1109/CVPR52688.2022.01169
– start-page: 1
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b239
  article-title: A 28nm 27.5 TOPS/W approximate-computing-based transformer processor with asymptotic sparsity speculating and out-of-order computing
– ident: 10.1016/j.sysarc.2023.102990_b132
  doi: 10.1145/3470496.3527438
– ident: 10.1016/j.sysarc.2023.102990_b141
– start-page: 4163
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b68
  article-title: TinyBERT: Distilling BERT for natural language understanding
– start-page: 1
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b163
  article-title: BEBERT: Efficient and robust binary ensemble BERT
– volume: 35
  start-page: 12934
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b179
  article-title: EfficientFormer: Vision transformers at mobilenet speed
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b101
  doi: 10.1145/3453688.3461740
– volume: 104
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b269
  article-title: A survey on modeling and improving reliability of DNN algorithms and accelerators
  publication-title: J. Syst. Archit.
  doi: 10.1016/j.sysarc.2019.101689
– ident: 10.1016/j.sysarc.2023.102990_b162
  doi: 10.18653/v1/2021.acl-long.334
– volume: 5
  start-page: 1
  issue: 3
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b69
  article-title: TPrune: Efficient transformer pruning for mobile devices
  publication-title: ACM Trans. Cyber-Phys. Syst.
  doi: 10.1145/3446640
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b171
– ident: 10.1016/j.sysarc.2023.102990_b83
  doi: 10.1109/IJCNN55064.2022.9892797
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b201
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b36
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b24
– ident: 10.1016/j.sysarc.2023.102990_b121
  doi: 10.1145/3453688.3461739
– ident: 10.1016/j.sysarc.2023.102990_b257
– ident: 10.1016/j.sysarc.2023.102990_b232
– volume: 35
  start-page: 24101
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b114
  article-title: A fast post-training pruning framework for transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 20
  start-page: 1
  issue: 5
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b118
  article-title: Algorithm-hardware co-design of attention mechanism on FPGA devices
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b35
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b5
– ident: 10.1016/j.sysarc.2023.102990_b169
– ident: 10.1016/j.sysarc.2023.102990_b135
  doi: 10.1109/CVPR.2019.00881
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b234
– ident: 10.1016/j.sysarc.2023.102990_b268
  doi: 10.1109/CVPR52688.2022.01195
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b14
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b28
– ident: 10.1016/j.sysarc.2023.102990_b99
  doi: 10.1609/aaai.v36i3.20222
– ident: 10.1016/j.sysarc.2023.102990_b4
– ident: 10.1016/j.sysarc.2023.102990_b215
  doi: 10.24963/ijcai.2020/341
– start-page: 213
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b8
  article-title: End-to-end object detection with transformers
– ident: 10.1016/j.sysarc.2023.102990_b261
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b19
– ident: 10.1016/j.sysarc.2023.102990_b184
  doi: 10.1109/ICCV48922.2021.00060
– volume: 33
  start-page: 5776
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b73
  article-title: MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b78
– volume: 34
  start-page: 24261
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b226
  article-title: MLP-Mixer: An all-MLP architecture for vision
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b31
– ident: 10.1016/j.sysarc.2023.102990_b265
  doi: 10.1109/ICCV.2017.406
– year: 2018
  ident: 10.1016/j.sysarc.2023.102990_b59
– ident: 10.1016/j.sysarc.2023.102990_b104
  doi: 10.1109/CVPR52688.2022.01185
– ident: 10.1016/j.sysarc.2023.102990_b182
  doi: 10.1109/ICCV48922.2021.00061
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b1
– year: 2017
  ident: 10.1016/j.sysarc.2023.102990_b2
  article-title: Attention is all you need
– volume: 32
  start-page: 25
  issue: 1
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b82
  article-title: Learning student networks via feature embedding
  publication-title: IEEE Trans. Neural Netw. Learn. Syst.
  doi: 10.1109/TNNLS.2020.2970494
– volume: 30
  start-page: 1573
  issue: 11
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b87
  article-title: An algorithm–hardware co-optimized framework for accelerating N:M sparse transformers
  publication-title: IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
  doi: 10.1109/TVLSI.2022.3197282
– ident: 10.1016/j.sysarc.2023.102990_b217
  doi: 10.1609/aaai.v36i10.21311
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b247
  article-title: An FPGA-Based transformer accelerator using output block stationary dataflow for object recognition applications
  publication-title: IEEE Trans. Circuits Syst. II
– ident: 10.1016/j.sysarc.2023.102990_b17
– volume: 34
  start-page: 6010
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b199
  article-title: Searching for efficient transformers for language modeling
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2016
  ident: 10.1016/j.sysarc.2023.102990_b60
– ident: 10.1016/j.sysarc.2023.102990_b38
  doi: 10.18653/v1/2022.emnlp-main.804
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b91
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b66
– ident: 10.1016/j.sysarc.2023.102990_b196
  doi: 10.24963/ijcai.2021/472
– year: 2017
  ident: 10.1016/j.sysarc.2023.102990_b228
  article-title: The reversible residual network: Backpropagation without storing activations
– ident: 10.1016/j.sysarc.2023.102990_b205
  doi: 10.18653/v1/2021.acl-long.400
– ident: 10.1016/j.sysarc.2023.102990_b53
  doi: 10.18653/v1/2020.acl-main.686
– ident: 10.1016/j.sysarc.2023.102990_b186
  doi: 10.1109/ICCV.2019.00140
– ident: 10.1016/j.sysarc.2023.102990_b89
  doi: 10.1109/ISCAS48785.2022.9937659
– start-page: 47
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b212
  article-title: NASformer: Neural architecture search for vision transformer
– ident: 10.1016/j.sysarc.2023.102990_b43
  doi: 10.1109/CVPR.2018.00286
– year: 2018
  ident: 10.1016/j.sysarc.2023.102990_b159
– start-page: 7588
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b224
  article-title: Neural architecture search without training
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b32
– start-page: 513
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b142
  article-title: Hardware acceleration of fully quantized bert for efficient natural language processing
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b11
– ident: 10.1016/j.sysarc.2023.102990_b108
  doi: 10.1109/HPCA51647.2021.00018
– ident: 10.1016/j.sysarc.2023.102990_b174
– volume: 35
  start-page: 27730
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b21
  article-title: Training language models to follow instructions with human feedback
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 10
  start-page: 108374
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b188
  article-title: Neural architecture search for transformers: A survey
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2022.3212767
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b96
– ident: 10.1016/j.sysarc.2023.102990_b116
  doi: 10.18653/v1/2021.findings-acl.425
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b62
  article-title: BERT: Pre-training of deep bidirectional transformers for language understanding
– start-page: 10271
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b165
  article-title: Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point
– start-page: 3
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b172
  article-title: Edgenext: Efficiently amalgamated CNN-transformer architecture for mobile vision applications
– ident: 10.1016/j.sysarc.2023.102990_b222
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b208
– start-page: 9010
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b65
  article-title: SAViT: Structure-aware vision transformer pruning via collaborative optimization
– year: 2018
  ident: 10.1016/j.sysarc.2023.102990_b139
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b29
– ident: 10.1016/j.sysarc.2023.102990_b134
– volume: 11
  start-page: 3550
  issue: 21
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b252
  article-title: EFA-Trans: An efficient and flexible acceleration architecture for transformers
  publication-title: Electronics
  doi: 10.3390/electronics11213550
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b206
– ident: 10.1016/j.sysarc.2023.102990_b211
– ident: 10.1016/j.sysarc.2023.102990_b107
  doi: 10.1109/ISCA45697.2020.00086
– ident: 10.1016/j.sysarc.2023.102990_b180
  doi: 10.1109/CVPR.2018.00474
– volume: 35
  start-page: 3217
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b161
  article-title: XTC: Extreme compression for pre-trained transformers made simple and efficient
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b191
– ident: 10.1016/j.sysarc.2023.102990_b157
  doi: 10.1109/CVPR42600.2020.00232
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b34
– ident: 10.1016/j.sysarc.2023.102990_b46
  doi: 10.1109/ICCV48922.2021.01205
– year: 2015
  ident: 10.1016/j.sysarc.2023.102990_b40
  article-title: Learning both weights and connections for efficient neural network
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b72
– start-page: 692
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b54
  article-title: ELSA: Hardware-Software co-design for efficient, lightweight self-attention mechanism in neural networks
– start-page: 154
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b143
  article-title: Patch similarity aware data-free quantization for vision transformers
– start-page: 351
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b181
  article-title: Dynamic ReLU
– start-page: 367
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b223
  article-title: Random search and reproducibility for neural architecture search
– year: 2018
  ident: 10.1016/j.sysarc.2023.102990_b44
– ident: 10.1016/j.sysarc.2023.102990_b218
  doi: 10.1609/aaai.v34i05.6462
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b175
  article-title: TFormer: A transmission-friendly ViT model for IoT devices
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– volume: 34
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b202
  article-title: Searching the search space of vision transformer
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b126
  doi: 10.1109/MICRO50266.2020.00071
– volume: 29
  start-page: 3451
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b231
  article-title: HuBERT: Self-supervised speech representation learning by masked prediction of hidden units
  publication-title: IEEE/ACM Trans. Audio Speech Lang. Process.
  doi: 10.1109/TASLP.2021.3122291
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b37
– ident: 10.1016/j.sysarc.2023.102990_b94
  doi: 10.18653/v1/2021.emnlp-main.829
– ident: 10.1016/j.sysarc.2023.102990_b249
  doi: 10.1109/MICRO56248.2022.00051
– year: 2015
  ident: 10.1016/j.sysarc.2023.102990_b45
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b106
  article-title: DTATrans: Leveraging dynamic token-based quantization with accuracy compensation mechanism for efficient transformer architecture
  publication-title: IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b173
  article-title: Separable self-attention for mobile vision transformers
  publication-title: Trans. Mach. Learn. Res.
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b237
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b263
– ident: 10.1016/j.sysarc.2023.102990_b246
  doi: 10.1145/3466752.3480125
– ident: 10.1016/j.sysarc.2023.102990_b7
  doi: 10.1109/ICCV48922.2021.00986
– volume: 33
  start-page: 9782
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b76
  article-title: DynaBERT: Dynamic BERT with adaptive width and depth
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b124
  doi: 10.1109/ISQED51717.2021.9424344
– ident: 10.1016/j.sysarc.2023.102990_b51
– start-page: 14303
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b160
  article-title: Bit: Robustly binarized multi-distilled transformer
– ident: 10.1016/j.sysarc.2023.102990_b245
  doi: 10.1145/3489517.3530504
– start-page: 5506
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b150
  article-title: I-BERT: Integer-only BERT quantization
– ident: 10.1016/j.sysarc.2023.102990_b213
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b48
  article-title: Are sixteen heads really better than one?
– ident: 10.1016/j.sysarc.2023.102990_b225
  doi: 10.1609/aaai.v35i12.17233
– start-page: 1
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b253
  article-title: PipeBERT: High-throughput BERT inference for ARM big. LITTLE multi-core processors
  publication-title: J. Signal Process. Syst.
– volume: 1
  start-page: 9
  issue: 8
  year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b64
  article-title: Language models are unsupervised multitask learners
  publication-title: OpenAI blog
– ident: 10.1016/j.sysarc.2023.102990_b6
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b154
– ident: 10.1016/j.sysarc.2023.102990_b153
  doi: 10.24963/ijcai.2022/164
– volume: 55
  start-page: 1
  issue: 4
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b187
  article-title: Neural architecture search survey: A hardware perspective
  publication-title: ACM Comput. Surv.
  doi: 10.1145/3524500
– start-page: 328
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b145
  article-title: Â 3: Accelerating attention mechanisms in neural networks with approximation
– ident: 10.1016/j.sysarc.2023.102990_b149
  doi: 10.1109/ICCV.2019.00038
– ident: 10.1016/j.sysarc.2023.102990_b251
  doi: 10.1109/ISCAS48785.2022.9937531
– start-page: 11875
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b152
  article-title: HAWQ-v3: Dyadic neural network quantization
– ident: 10.1016/j.sysarc.2023.102990_b255
  doi: 10.24963/ijcai.2021/165
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b155
– start-page: 33
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b189
  article-title: UniNet: Unified architecture search with convolution, transformer, and MLP
– ident: 10.1016/j.sysarc.2023.102990_b210
  doi: 10.1109/CVPR52688.2022.01062
– year: 2017
  ident: 10.1016/j.sysarc.2023.102990_b95
– ident: 10.1016/j.sysarc.2023.102990_b13
  doi: 10.18653/v1/2021.emnlp-main.274
– ident: 10.1016/j.sysarc.2023.102990_b183
  doi: 10.1109/CVPR.2016.90
– year: 2013
  ident: 10.1016/j.sysarc.2023.102990_b137
– ident: 10.1016/j.sysarc.2023.102990_b243
  doi: 10.1145/3489517.3530585
– ident: 10.1016/j.sysarc.2023.102990_b133
  doi: 10.1109/ISVLSI54635.2022.00092
– ident: 10.1016/j.sysarc.2023.102990_b250
  doi: 10.1145/3370748.3406567
– ident: 10.1016/j.sysarc.2023.102990_b111
  doi: 10.1109/CVPR52688.2022.00488
– start-page: 10347
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b52
  article-title: Training data-efficient image transformers & distillation through attention
– start-page: 14
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b241
  article-title: DOTA: Detect and omit weak attentions for scalable transformer acceleration
– ident: 10.1016/j.sysarc.2023.102990_b216
  doi: 10.1145/3447548.3467262
– ident: 10.1016/j.sysarc.2023.102990_b203
  doi: 10.1109/ICCV48922.2021.00008
– ident: 10.1016/j.sysarc.2023.102990_b242
  doi: 10.1109/AICAS54282.2022.9869924
– ident: 10.1016/j.sysarc.2023.102990_b74
– volume: 7
  start-page: 187
  issue: 2
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b266
  article-title: PCT: Point cloud transformer
  publication-title: Comput. Vis. Media
  doi: 10.1007/s41095-021-0229-5
– volume: 33
  start-page: 20378
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b81
  article-title: Movement pruning: Adaptive sparsity by fine-tuning
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b15
– start-page: 139
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b190
  article-title: ViTAS: Vision transformer architecture search
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b55
– ident: 10.1016/j.sysarc.2023.102990_b200
  doi: 10.1145/3517206.3526271
– ident: 10.1016/j.sysarc.2023.102990_b209
  doi: 10.1109/ICASSP39728.2021.9414403
– ident: 10.1016/j.sysarc.2023.102990_b229
  doi: 10.1109/CVPR.2019.01099
– ident: 10.1016/j.sysarc.2023.102990_b144
– volume: 34
  start-page: 1818
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b88
  article-title: NxMTransformer: Semi-structured sparsification for natural language understanding via ADMM
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 33
  start-page: 2771
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b236
  article-title: ShiftAddNet: A hardware-inspired deep network
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b264
  doi: 10.1145/3474085.3475439
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b12
– volume: 9
  start-page: 1061
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b102
  article-title: Compressing large-scale transformer-based models: A case study on BERT
  publication-title: Trans. Assoc. Comput. Linguist.
  doi: 10.1162/tacl_a_00413
– ident: 10.1016/j.sysarc.2023.102990_b120
  doi: 10.1109/IISWC55918.2022.00019
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b259
  article-title: Neural architecture search benchmarks: Insights and survey
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2023.3253818
– ident: 10.1016/j.sysarc.2023.102990_b258
  doi: 10.1145/3123939.3123982
– ident: 10.1016/j.sysarc.2023.102990_b80
  doi: 10.1609/aaai.v35i9.16969
– year: 2017
  ident: 10.1016/j.sysarc.2023.102990_b123
– ident: 10.1016/j.sysarc.2023.102990_b93
– start-page: 169
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b214
  article-title: Autotrans: Automating transformer design via reinforced architecture search
– year: 2016
  ident: 10.1016/j.sysarc.2023.102990_b61
– ident: 10.1016/j.sysarc.2023.102990_b136
  doi: 10.1109/ICCV.2019.00141
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b270
  article-title: A survey on hardware security of DNN models and accelerators
  publication-title: J. Syst. Archit.
– volume: 48
  start-page: 733
  issue: 3
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b58
  article-title: Position information in transformers: An overview
  publication-title: Comput. Linguist.
  doi: 10.1162/coli_a_00445
– ident: 10.1016/j.sysarc.2023.102990_b41
– volume: 55
  start-page: 1
  issue: 14s
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b56
  article-title: A practical survey on faster and lighter transformers
  publication-title: ACM Comput. Surv.
  doi: 10.1145/3586074
– start-page: 274
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b105
  article-title: An attention-based token pruning method for vision transformers
– ident: 10.1016/j.sysarc.2023.102990_b148
  doi: 10.18653/v1/2021.emnlp-main.627
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b75
– ident: 10.1016/j.sysarc.2023.102990_b220
– start-page: 7105
  year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b260
  article-title: NAS-bench-101: Towards reproducible neural architecture search
– volume: 34
  start-page: 12077
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b254
  article-title: SegFormer: Simple and efficient design for semantic segmentation with transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b47
– volume: 128
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b238
  article-title: DiVIT: Algorithm and architecture co-design of differential attention in vision transformer
  publication-title: J. Syst. Archit.
  doi: 10.1016/j.sysarc.2022.102520
– volume: 2
  start-page: 216
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b26
  article-title: CPM-2: Large-scale cost-effective pre-trained language models
  publication-title: AI Open
  doi: 10.1016/j.aiopen.2021.12.003
– start-page: 3187
  year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b122
  article-title: Efficient transformer-based large scale language representations using hardware-friendly block structured pruning
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b79
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b22
– ident: 10.1016/j.sysarc.2023.102990_b178
  doi: 10.1109/CVPR52688.2022.00520
– start-page: 442
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b70
  article-title: Heatvit: Hardware-efficient adaptive token pruning for vision transformers
– volume: 34
  start-page: 28092
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b128
  article-title: Post-training quantization for vision transformer
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2020
  ident: 10.1016/j.sysarc.2023.102990_b227
– ident: 10.1016/j.sysarc.2023.102990_b77
  doi: 10.1109/CVPR52688.2022.01174
– start-page: 191
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b125
  article-title: PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b130
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b23
– volume: 34
  start-page: 24898
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b109
  article-title: IA-RED2: Interpretability-aware redundancy reduction for vision transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b167
– start-page: 205
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b262
  article-title: Swin-Unet: Unet-like pure transformer for medical image segmentation
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b110
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b197
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b244
  article-title: Accelerating attention mechanism on FPGAs based on efficient reconfigurable systolic array
  publication-title: ACM Trans. Embedded Comput. Syst. (TECS)
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b131
– start-page: 25055
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b204
  article-title: Searching for BurgerFormer with micro-meso-macro space design
– start-page: 599
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b117
  article-title: Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design
– ident: 10.1016/j.sysarc.2023.102990_b221
  doi: 10.1109/CVPR.2019.00492
– start-page: 7480
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b10
  article-title: Scaling vision transformers to 22 billion parameters
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b115
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b129
– ident: 10.1016/j.sysarc.2023.102990_b67
  doi: 10.18653/v1/2020.acl-main.195
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b207
– ident: 10.1016/j.sysarc.2023.102990_b147
  doi: 10.1609/aaai.v34i05.6409
– ident: 10.1016/j.sysarc.2023.102990_b92
  doi: 10.18653/v1/P19-1580
– year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b193
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b16
– year: 2018
  ident: 10.1016/j.sysarc.2023.102990_b63
– year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b151
– ident: 10.1016/j.sysarc.2023.102990_b194
  doi: 10.1145/3461615.3491109
– start-page: 1
  year: 2021
  ident: 10.1016/j.sysarc.2023.102990_b267
  article-title: Spiking transformer networks: A rate coded approach for processing sequential data
– ident: 10.1016/j.sysarc.2023.102990_b140
  doi: 10.1109/CVPR.2019.00448
– volume: 35
  start-page: 27168
  year: 2022
  ident: 10.1016/j.sysarc.2023.102990_b146
  article-title: Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.sysarc.2023.102990_b49
  doi: 10.18653/v1/2022.emnlp-main.279
– ident: 10.1016/j.sysarc.2023.102990_b90
  doi: 10.18653/v1/2022.acl-long.107
– ident: 10.1016/j.sysarc.2023.102990_b240
  doi: 10.1109/ISQED54688.2022.9806143
– ident: 10.1016/j.sysarc.2023.102990_b25
– year: 2019
  ident: 10.1016/j.sysarc.2023.102990_b84
– volume: 66
  start-page: 1
  issue: 7
  year: 2023
  ident: 10.1016/j.sysarc.2023.102990_b100
  article-title: A unified pruning framework for vision transformers
  publication-title: Sci. China Inf. Sci.
  doi: 10.1007/s11432-022-3646-6
– ident: 10.1016/j.sysarc.2023.102990_b219
– ident: 10.1016/j.sysarc.2023.102990_b113
  doi: 10.1609/aaai.v36i3.20202
SSID ssj0005512
Score 2.5835602
SecondaryResourceType review_article
Snippet Recent years have seen a phenomenal rise in the performance and applications of transformer neural networks. The family of transformer networks, including...
SourceID osti
crossref
elsevier
SourceType Open Access Repository
Enrichment Source
Index Database
Publisher
StartPage 102990
Title A survey of techniques for optimizing transformer inference
URI https://dx.doi.org/10.1016/j.sysarc.2023.102990
https://www.osti.gov/biblio/2004641
Volume 144
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEF5KvXjxLdZq2YPXbZvsNg88lWKpir1oobewjwQi2pQmFerB3-5Mki0KQsFjws6SzM5-8y07D0JuYic0Pk848x0dMKFjzaRMXGYwWUb2Ax2WN_hPU28yEw_zwbxBRjYXBsMqa-yvML1E6_pNr9Zmb5mmvWcHD1ceQi_ylhATfoXw0cq7Xz_CPAbVjScMZjjaps-VMV75Jgdz6mILcaxhUCLz3-6pmcGO--F5xkfkoKaMdFh91TFpxIsTcmjbMdB6d56S2yHN16uPeEOzhG5rs-YUaCnNABne00_wU7SwVBVkU5vud0Zm47uX0YTVvRGY5n5YMAA4OHsYRxmulRShAqanNbh_ExgJDkgY2Kjwz0C3lOtq4SiReFL3zcBLktD1-DlpLrJFfEFoqE3Ql57mWigB7EjGnuGwXm4ifAV8sUW4VUmk68Lh2L_iLbIRYq9RpcgIFRlVimwRtpVaVoUzdoz3rbajXwYQAbbvkGzj4qAU6lZjgBCIlUd_4Vz-e9422cenKvHwijSL1Tq-BgZSqE5pYh2yN7x_nEy_AZyl2m0
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JS8NAFH7U9KAXd7HWZQ5exzbJZMNTKZbWLhdb6C1MZhKIaFOaVKi_3jdZqoJQ8JrkheSbme99w7wF4D7UPemYkUkdXbiUiVBQziODSpUsw9uu8PIT_PHE7s_Y89ya16Bb5cKosMqS-wtOz9m6vNIq0Wwt47j1oqvNla2oV-kWz92DuqpOZWlQ7wyG_cl3pIdVHHri81QZVBl0eZhXuklxRj2oLuKqjEFOzn97KC3BRffD-fSO4bBUjaRTfNgJ1MLFKRxVHRlIuUDP4LFD0vXqI9yQJCLb8qwpQWVKEiSH9_gTXRXJKrWKtnGV8XcOs97TtNunZXsEKkzHyyhyHG4_pB5IUwSceQGKPSFQAUhXcvRBTOJaxX9GxRUYhmB6wCKbi7a07CjyDNu8AG2RLMJLIJ6QbpvbwhQsYCiQeGhLE4fMiJgToGRsgFlB4ouydrhqYfHmV0Fir34BpK-A9AsgG0C3VsuidsaO550Kbf_XHPCR3ndYNtXgKCuFrVAxQmiW7_6ZfvXv997Bfn86HvmjwWTYhAN1p8hDvAYtW63DGxQkWXBbTrgvJrndHg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+survey+of+techniques+for+optimizing+transformer+inference&rft.jtitle=Journal+of+systems+architecture&rft.au=Chitty-Venkata%2C+Krishna+Teja&rft.au=Mittal%2C+Sparsh&rft.au=Emani%2C+Murali&rft.au=Vishwanath%2C+Venkatram&rft.date=2023-11-01&rft.issn=1383-7621&rft.volume=144&rft.spage=102990&rft_id=info:doi/10.1016%2Fj.sysarc.2023.102990&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_sysarc_2023_102990
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1383-7621&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1383-7621&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1383-7621&client=summon