SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to...
Saved in:
Published in | Proceedings - International Symposium on High-Performance Computer Architecture pp. 97 - 110 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.02.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively. |
---|---|
AbstractList | The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, general-purpose platforms such as CPUs and GPUs are inefficient when performing attention inference due to complicated data movement and low arithmetic intensity. Moreover, existing NN accelerators mainly focus on optimizing convolutional or recurrent models, and cannot efficiently support attention. In this paper, we present SpAtten, an efficient algorithm-architecture co-design that leverages token sparsity, head sparsity, and quantization opportunities to reduce the attention computation and memory access. Inspired by the high redundancy of human languages, we propose the novel cascade token pruning to prune away unimportant tokens in the sentence. We also propose cascade head pruning to remove unessential heads. Cascade pruning is fundamentally different from weight pruning since there is no trainable weight in the attention mechanism, and the pruned tokens and heads are selected on the fly. To efficiently support them on hardware, we design a novel top-k engine to rank token and head importance scores with high throughput. Furthermore, we propose progressive quantization that first fetches MSBs only and performs the computation; if the confidence is low, it fetches LSBs and recomputes the attention outputs, trading computation for memory reduction.Extensive experiments on 30 benchmarks show that, on average, SpAtten reduces DRAM access by 10.0× with no accuracy loss, and achieves 1.6×, 3.0×, 162×, 347× speedup, and 1.4×, 3.2×, 1193×, 4059× energy savings over A 3 accelerator, MNNFast accelerator, TITAN Xp GPU, Xeon CPU, respectively. |
Author | Han, Song Wang, Hanrui Zhang, Zhekai |
Author_xml | – sequence: 1 givenname: Hanrui surname: Wang fullname: Wang, Hanrui email: hanrui@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US – sequence: 2 givenname: Zhekai surname: Zhang fullname: Zhang, Zhekai email: zhangzk@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US – sequence: 3 givenname: Song surname: Han fullname: Han, Song email: songhan@mit.edu organization: Massachusetts Institute of Technology,EECS,Cambridge,MA,US |
BookMark | eNotjl1LwzAYhaMouE1_gV7kD7TmzXe8K2WzwsDBJng3YvrWxY9stBniv7eoVwceznM4U3KW9gkJuQFWAjB326zqSoGWpuSMQ8kYA3tCpqC1kpwLBadkwoWxBWfi-YJMh-Ft7HCnYEI260OVM6Y7Ou-6GCKmTNcH3w9If3mO-0SrPuxixpCPPdKvmHe09kPwLdLN_h0T9amlDfqWrvpjiun1kpx3_mPAq_-ckafFfFM3xfLx_qGulkUcn-RCCqW8MVI4UGAYGMW81l2rkBnurGg90zJw6eAFgvWdtla6wEYLrAfrxIxc_-1GRNwe-vjp---tk6MuuPgB9HtQDA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/HPCA51647.2021.00018 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 1665422351 9781665422352 |
EISSN | 2378-203X |
EndPage | 110 |
ExternalDocumentID | 9407232 |
Genre | orig-research |
GroupedDBID | 29O 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS |
ID | FETCH-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:28:04 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i203t-4355a77439151701750a66fd5e072983da064c2491b1c8af68849c035518a1893 |
PageCount | 14 |
ParticipantIDs | ieee_primary_9407232 |
PublicationCentury | 2000 |
PublicationDate | 2021-Feb. |
PublicationDateYYYYMMDD | 2021-02-01 |
PublicationDate_xml | – month: 02 year: 2021 text: 2021-Feb. |
PublicationDecade | 2020 |
PublicationTitle | Proceedings - International Symposium on High-Performance Computer Architecture |
PublicationTitleAbbrev | HPCA |
PublicationYear | 2021 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0002951 |
Score | 2.6046557 |
Snippet | The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 97 |
SubjectTerms | Algorithm-Architecture Co-design Attention Domain-Specific Accelerator Memory management Natural language processing Pruning Quantization Quantization (signal) Random access memory Redundancy Space exploration Throughput |
Title | SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning |
URI | https://ieeexplore.ieee.org/document/9407232 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pa8IwFA7O005u07Hf5LDjomlj27ibiFIGDkEFb5JfhSFU2dqLf_3eS6uTscNuJYcm5CV535e89z1CnpOMGwwQZ8K5kAHf4AyWcsJcLK1WGXhob-npe5wu-2-raNUgL8dcGOecDz5zXfz0b_l2a0q8KusNUM1LwIF7BsStytU6nrohQIU6NS7gg146Gw0j1MoCChgGqFKIZT1OCqh4_zFpkemh5ypsZNMtC901-1-ijP8d2gXp_GTq0dnRB12ShsuvSOtQqoHWO7dNFvPdsAB4_ErHXjMCfkbnOyC1jvp2NA8dnjwqULygpSP1hfHzdLHduJyq3NIU1gT0WOJ1SocsJ-PFKGV1QQX2EXJRMIBGkUqQgoCfT2AvRlzFcWYjh_rhUlgFAMUAIQt0YKTKYin7A8MFqrapAJDNNWnm29zdECpiDUjPWEB8sh9ooySPMx1lXOjQhkrekjZO0npXaWas6_m5-7v5npyjmapo6AfSLD5L9wjOvtBP3srfYsKnVQ |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4heNATKhh_24NHB93GtuKNEMhUICSMhBtpuy4xJIPodvGv971tIDEevC09rE1f2-977XvfA3gMEq4pQNxyjXEs9De4hUs5sIwvYiUTROjC0pOpHy66r0tvWYOnfS6MMaYIPjNt-ize8uONzumqrNMjNS8XD9wjxH3PLrO19ueug2ShSo6zea8TzgZ9j9Sy0Al0bNIppMIeByVUCgQZNWCy67sMHFm380y19dcvWcb_Du4UWj-5emy2R6EzqJn0HBq7Yg2s2rtNiObbfoYE-ZkNC9UI_Bmbb9GtNaxoJwOx_sGzAqMrWjaQnxRBz6LN2qRMpjELcVVgjzldqLRgMRpGg9CqSipY7w53MwvJkScDckIQ6QPcjR6Xvp_EniEFceHGEimKRpfMVrYWMvGF6PY0d0m3TdrIbS6gnm5ScwnM9RVyPR0j5xNdW2kpuJ8oL-GucmJHiito0iSttqVqxqqan-u_mx_gOIwm49X4Zfp2AydksjI2-hbq2Udu7hD6M3VfWPwbUCGqng |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+-+International+Symposium+on+High-Performance+Computer+Architecture&rft.atitle=SpAtten%3A+Efficient+Sparse+Attention+Architecture+with+Cascade+Token+and+Head+Pruning&rft.au=Wang%2C+Hanrui&rft.au=Zhang%2C+Zhekai&rft.au=Han%2C+Song&rft.date=2021-02-01&rft.pub=IEEE&rft.eissn=2378-203X&rft.spage=97&rft.epage=110&rft_id=info:doi/10.1109%2FHPCA51647.2021.00018&rft.externalDocID=9407232 |