Breadth-First Pipeline Parallelism

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by maki...

Full description

Saved in:

Bibliographic Details
Main Author	Lamy-Poirier, Joel
Format	Journal Article
Language	English
Published	10.11.2022
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed an increase of up to 43% in training throughput for a 52 billion-parameter model using a small batch size per GPU compared to Megatron-LM, which would reduce the training time and cost by the same amount on a large GPU cluster.
DOI:	10.48550/arxiv.2211.05953