SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to...

Full description

Saved in:
Bibliographic Details
Published inProceedings - International Symposium on High-Performance Computer Architecture pp. 97 - 110
Main Authors Wang, Hanrui, Zhang, Zhekai, Han, Song
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.02.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.
AbstractList The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.
Author Han, Song
Wang, Hanrui
Zhang, Zhekai
Author_xml – sequence: 1
  givenname: Hanrui
  surname: Wang
  fullname: Wang, Hanrui
  email: hanrui@mit.edu
  organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US
– sequence: 2
  givenname: Zhekai
  surname: Zhang
  fullname: Zhang, Zhekai
  email: zhangzk@mit.edu
  organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US
– sequence: 3
  givenname: Song
  surname: Han
  fullname: Han, Song
  email: songhan@mit.edu
  organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US
BookMark eNotjl1LwzAYhaMouE1_gV7kD7TmzXe8K2WzwsDBJng3YvrWxY9stBniv7eoVwceznM4U3KW9gkJuQFWAjB326zqSoGWpuSMQ8kYA3tCpqC1kpwLBadkwoWxBWfi-YJMh-Ft7HCnYEI260OVM6Y7Ou-6GCKmTNcH3w9If3mO-0SrPuxixpCPPdKvmHe09kPwLdLN_h0T9amlDfqWrvpjiun1kpx3_mPAq_-ckafFfFM3xfLx_qGulkUcn-RCCqW8MVI4UGAYGMW81l2rkBnurGg90zJw6eAFgvWdtla6wEYLrAfrxIxc_-1GRNwe-vjp---tk6MuuPgB9HtQDA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HPCA51647.2021.00018
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1665422351
9781665422352
EISSN 2378-203X
EndPage 110
ExternalDocumentID 9407232
Genre orig-research
GroupedDBID 29O
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893
IEDL.DBID RIE
IngestDate Wed Aug 27 02:28:04 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893
PageCount 14
ParticipantIDs ieee_primary_9407232
PublicationCentury 2000
PublicationDate 2021-Feb.
PublicationDateYYYYMMDD 2021-02-01
PublicationDate_xml – month: 02
  year: 2021
  text: 2021-Feb.
PublicationDecade 2020
PublicationTitle Proceedings - International Symposium on High-Performance Computer Architecture
PublicationTitleAbbrev HPCA
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002951
Score 2.6046557
Snippet The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and...
SourceID ieee
SourceType Publisher
StartPage 97
SubjectTerms Algorithm-Architecture Co-design
Attention
Domain-Specific Accelerator
Memory management
Natural language processing
Pruning
Quantization
Quantization (signal)
Random access memory
Redundancy
Space exploration
Throughput
Title SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
URI https://ieeexplore.ieee.org/document/9407232
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pa8IwFA7O005u07Hf5LDjomlj27ibiFIGDkEFb5JfhSFU2dqLf_3eS6uTscNuJYcm5CV535e89z1CnpOMGwwQZ8K5kAHf4AyWcsJcLK1WGXhob-npe5wu-2-raNUgL8dcGOecDz5zXfz0b_l2a0q8KusNUM1LwIF7BsStytU6nrohQIU6NS7gg146Gw0j1MoCChgGqFKIZT1OCqh4_zFpkemh5ypsZNMtC901-1-ijP8d2gXp_GTq0dnRB12ShsuvSOtQqoHWO7dNFvPdsAB4_ErHXjMCfkbnOyC1jvp2NA8dnjwqULygpSP1hfHzdLHduJyq3NIU1gT0WOJ1SocsJ-PFKGV1QQX2EXJRMIBGkUqQgoCfT2AvRlzFcWYjh_rhUlgFAMUAIQt0YKTKYin7A8MFqrapAJDNNWnm29zdECpiDUjPWEB8sh9ooySPMx1lXOjQhkrekjZO0npXaWas6_m5-7v5npyjmapo6AfSLD5L9wjOvtBP3srfYsKnVQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4heNATKhh_24NHB93GtuKNEMhUICSMhBtpuy4xJIPodvGv971tIDEevC09rE1f2-977XvfA3gMEq4pQNxyjXEs9De4hUs5sIwvYiUTROjC0pOpHy66r0tvWYOnfS6MMaYIPjNt-ize8uONzumqrNMjNS8XD9wjxH3PLrO19ueug2ShSo6zea8TzgZ9j9Sy0Al0bNIppMIeByVUCgQZNWCy67sMHFm380y19dcvWcb_Du4UWj-5emy2R6EzqJn0HBq7Yg2s2rtNiObbfoYE-ZkNC9UI_Bmbb9GtNaxoJwOx_sGzAqMrWjaQnxRBz6LN2qRMpjELcVVgjzldqLRgMRpGg9CqSipY7w53MwvJkScDckIQ6QPcjR6Xvp_EniEFceHGEimKRpfMVrYWMvGF6PY0d0m3TdrIbS6gnm5ScwnM9RVyPR0j5xNdW2kpuJ8oL-GucmJHiito0iSttqVqxqqan-u_mx_gOIwm49X4Zfp2AydksjI2-hbq2Udu7hD6M3VfWPwbUCGqng
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Symposium+on+High-Performance+Computer+Architecture&rft.atitle=SpAtten%3A+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning&rft.au=Wang%2C+Hanrui&rft.au=Zhang%2C+Zhekai&rft.au=Han%2C+Song&rft.date=2021-02-01&rft.pub=IEEE&rft.eissn=2378-203X&rft.spage=97&rft.epage=110&rft_id=info:doi/10.1109%2FHPCA51647.2021.00018&rft.externalDocID=9407232