SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings - International Symposium on High-Performance Computer Architecture pp. 97 - 110
Main Authors	Wang, Hanrui, Zhang, Zhekai, Han, Song
Format	Conference Proceeding
Language	English
Published	IEEE 01.02.2021
Subjects	Algorithm-Architecture Co-design Attention Domain-Specific Accelerator Memory management Natural language processing Pruning Quantization Quantization (signal) Random access memory Redundancy Space exploration Throughput
Online Access	Get full text

Cover

Loading…

Abstract	The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.
AbstractList	The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively.
Author	Han, Song Wang, Hanrui Zhang, Zhekai
Author_xml	– sequence: 1 givenname: Hanrui surname: Wang fullname: Wang, Hanrui email: hanrui@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US – sequence: 2 givenname: Zhekai surname: Zhang fullname: Zhang, Zhekai email: zhangzk@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US – sequence: 3 givenname: Song surname: Han fullname: Han, Song email: songhan@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US
BookMark	eNotjl1LwzAYhaMouE1_gV7kD7TmzXe8K2WzwsDBJng3YvrWxY9stBniv7eoVwceznM4U3KW9gkJuQFWAjB326zqSoGWpuSMQ8kYA3tCpqC1kpwLBadkwoWxBWfi-YJMh-Ft7HCnYEI260OVM6Y7Ou-6GCKmTNcH3w9If3mO-0SrPuxixpCPPdKvmHe09kPwLdLN_h0T9amlDfqWrvpjiun1kpx3_mPAq_-ckafFfFM3xfLx_qGulkUcn-RCCqW8MVI4UGAYGMW81l2rkBnurGg90zJw6eAFgvWdtla6wEYLrAfrxIxc_-1GRNwe-vjp---tk6MuuPgB9HtQDA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/HPCA51647.2021.00018
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	1665422351 9781665422352
EISSN	2378-203X
EndPage	110
ExternalDocumentID	9407232
Genre	orig-research
GroupedDBID	29O 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS
ID	FETCH-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:28:04 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893
PageCount	14
ParticipantIDs	ieee_primary_9407232
PublicationCentury	2000
PublicationDate	2021-Feb.
PublicationDateYYYYMMDD	2021-02-01
PublicationDate_xml	– month: 02 year: 2021 text: 2021-Feb.
PublicationDecade	2020
PublicationTitle	Proceedings - International Symposium on High-Performance Computer Architecture
PublicationTitleAbbrev	HPCA
PublicationYear	2021
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0002951
Score	2.6046557
Snippet	The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and...
SourceID	ieee
SourceType	Publisher
StartPage	97
SubjectTerms	Algorithm-Architecture Co-design Attention Domain-Specific Accelerator Memory management Natural language processing Pruning Quantization Quantization (signal) Random access memory Redundancy Space exploration Throughput
Title	SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
URI	https://ieeexplore.ieee.org/document/9407232
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pa8IwFA7O005u07Hf5LDjomlj27ibiFIGDkEFb5JfhSFU2dqLf_3eS6uTscNuJYcm5CV535e89z1CnpOMGwwQZ8K5kAHf4AyWcsJcLK1WGXhob-npe5wu-2-raNUgL8dcGOecDz5zXfz0b_l2a0q8KusNUM1LwIF7BsStytU6nrohQIU6NS7gg146Gw0j1MoCChgGqFKIZT1OCqh4_zFpkemh5ypsZNMtC901-1-ijP8d2gXp_GTq0dnRB12ShsuvSOtQqoHWO7dNFvPdsAB4_ErHXjMCfkbnOyC1jvp2NA8dnjwqULygpSP1hfHzdLHduJyq3NIU1gT0WOJ1SocsJ-PFKGV1QQX2EXJRMIBGkUqQgoCfT2AvRlzFcWYjh_rhUlgFAMUAIQt0YKTKYin7A8MFqrapAJDNNWnm29zdECpiDUjPWEB8sh9ooySPMx1lXOjQhkrekjZO0npXaWas6_m5-7v5npyjmapo6AfSLD5L9wjOvtBP3srfYsKnVQ
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4heNATKhh_24NHB93GtuKNEMhUICSMhBtpuy4xJIPodvGv971tIDEevC09rE1f2-977XvfA3gMEq4pQNxyjXEs9De4hUs5sIwvYiUTROjC0pOpHy66r0tvWYOnfS6MMaYIPjNt-ize8uONzumqrNMjNS8XD9wjxH3PLrO19ueug2ShSo6zea8TzgZ9j9Sy0Al0bNIppMIeByVUCgQZNWCy67sMHFm380y19dcvWcb_Du4UWj-5emy2R6EzqJn0HBq7Yg2s2rtNiObbfoYE-ZkNC9UI_Bmbb9GtNaxoJwOx_sGzAqMrWjaQnxRBz6LN2qRMpjELcVVgjzldqLRgMRpGg9CqSipY7w53MwvJkScDckIQ6QPcjR6Xvp_EniEFceHGEimKRpfMVrYWMvGF6PY0d0m3TdrIbS6gnm5ScwnM9RVyPR0j5xNdW2kpuJ8oL-GucmJHiito0iSttqVqxqqan-u_mx_gOIwm49X4Zfp2AydksjI2-hbq2Udu7hD6M3VfWPwbUCGqng
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Symposium+on+High-Performance+Computer+Architecture&rft.atitle=SpAtten%3A+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning&rft.au=Wang%2C+Hanrui&rft.au=Zhang%2C+Zhekai&rft.au=Han%2C+Song&rft.date=2021-02-01&rft.pub=IEEE&rft.eissn=2378-203X&rft.spage=97&rft.epage=110&rft_id=info:doi/10.1109%2FHPCA51647.2021.00018&rft.externalDocID=9407232