Throughput-oriented and Accuracy-aware DNN Training with BFloat16 on GPU

Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and achieved extraordinary success in many areas. The training of DNNs is commonly compute and memory-intensive, which has resulted in several optimizations in the training phase. Among them, reduced precision is a typ...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 1084 - 1087
Main Authors	Xie, Zhen, Raskar, Siddhisanket, Emani, Murali
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2022
Subjects	BFloat16 Conferences Deep learning Distributed processing DNN training Graphics processing units Memory management Mixed Precision Training Neural networks Normalization Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep Neural Networks (DNNs) have transformed the field of artificial intelligence and achieved extraordinary success in many areas. The training of DNNs is commonly compute and memory-intensive, which has resulted in several optimizations in the training phase. Among them, reduced precision is a typical and widely used technique to accelerate DNN training and reduce memory requirements. However, applying a widely adopted reduced precision format such as Float16 to all involved operations in DNN training is not optimal as the use of Float16 in some operations can hurt model accuracy. Meanwhile, additional optimizations including loss scaling and autocast techniques can mitigate the accuracy loss but lead to inherent overhead and inadequate use of reduced precision. In this work, we leverage another reduced precision format, BFloat16, and introduce a throughput-oriented and accuracy-aware approach to maximize the performance potential of DNN training. Since the high throughput provided by BFloat16 format is accompanied by low precision of the floating-point representation, this approach achieves high throughput by using BFloat16 on all DNN op-erations and avoids the accuracy loss through a customized accuracy-aware normalization. Results show that our approach outperforms the state-of-the-art mixed precision training by 1.21x on an NVIDIA A100 GPU.
ISBN:	9781665497480
DOI:	10.1109/IPDPSW55747.2022.00176