ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	İslamoğlu, Gamze, Scherer, Moritz, Paulin, Gianna, Fischer, Tim, Jung, Victor J B, Garofalo, Angelo, Benini, Luca
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 10.07.2023
Subjects	Computer Science - Hardware Architecture Computer Science - Learning Computer vision Energy consumption Natural language processing State of the art Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm\(^2\) in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2307.03493