Demystifying BERT: System Design Implications

Transfer learning in natural language processing (NLP) uses increasingly large models that tackle challenging problems. Consequently, these applications are driving the requirements of future systems. To this end, we study the computationally and time-intensive training phase of NLP models and ident...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE International Symposium on Workload Characterization (IISWC) pp. 296 - 309
Main Authors	Pati, Suchita, Aga, Shaizeen, Jayasena, Nuwan, Sinclair, Matthew D.
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2022
Subjects	Accelerator design Bit error rate Characterization Computational modeling Deep Learning Near memory Computing Propulsion Technological innovation Training Transfer learning Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Transfer learning in natural language processing (NLP) uses increasingly large models that tackle challenging problems. Consequently, these applications are driving the requirements of future systems. To this end, we study the computationally and time-intensive training phase of NLP models and identify how its algorithmic behavior can guide future accelerator design. We focus on BERT (Bi-directional Encoder Representations from Transformer), one of the most popular Transformer-based NLP models, and identify key operations which are worthy of attention in accelerator design. In particular, we focus on the manifestation, size, and arithmetic behavior of these operations which remain constant irrespective of hardware choice. Our results show that although computations which manifest as matrix multiplications dominate BERT's execution, they have considerable heterogeneity. Furthermore, we characterize memory-intensive computations which also feature prominently in BERT but have received less attention. To capture future Transformer trends, we also show and discuss implications of these behaviors as networks get larger. Moreover, we study the impact of key training techniques like distributed training, check-pointing, and mixed-precision training. Finally, our analysis identifies holistic solutions to optimize systems for BERT-like models and we further demonstrate how enhancing compute-intensive accelerators with near-memory compute can help accelerate Transformer networks.
DOI:	10.1109/IISWC55918.2022.00033