A Framework for Accelerating Transformer-Based Language Model on ReRAM-Based Architecture

Transformer-based language models have become the de-facto standard model for various natural language processing (NLP) applications given the superior algorithmic performances. Processing a transformer-based language model on a conventional accelerator induces the memory wall problem, and the ReRAM...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computer-aided design of integrated circuits and systems Vol. 41; no. 9; pp. 3026 - 3039
Main Authors	Kang, Myeonggu, Shin, Hyein, Kim, Lee-Sup
Format	Journal Article
Language	English
Published	New York IEEE 01.09.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms BERT Bit error rate Computational modeling Computer architecture Data models deep learning Hazards Integrated circuit modeling Language Natural language processing Optimization ReRAM-based accelerator Search algorithms self-attention transformer-based language model Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Transformer-based language models have become the de-facto standard model for various natural language processing (NLP) applications given the superior algorithmic performances. Processing a transformer-based language model on a conventional accelerator induces the memory wall problem, and the ReRAM-based accelerator is a promising solution to this problem. However, due to the characteristics of the self-attention mechanism and the ReRAM-based accelerator, the pipeline hazard arises when processing the transformer-based language model on the ReRAM-based accelerator. This hazard issue greatly increases the overall execution time. In this article, we propose a framework to resolve the hazard issue. First, we propose the concept of window self-attention to reduce the attention computation scope by analyzing the properties of the self-attention mechanism. After that, we present a window-size search algorithm, which finds an optimal window size set according to the target application/algorithmic performance. We also suggest a hardware design that exploits the advantages of the proposed algorithm optimization on the general ReRAM-based accelerator. The proposed work successfully alleviates the hazard issue while maintaining the algorithmic performance, leading to a <inline-formula> <tex-math notation="LaTeX">5.8\times </tex-math></inline-formula> speedup over the provisioned baseline. It also delivers up to <inline-formula> <tex-math notation="LaTeX">39.2\times /643.2\times </tex-math></inline-formula> speedup/higher energy efficiency over GPU, respectively.
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2021.3121264