Breadth-First Pipeline Parallelism
We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by maki...
Saved in:
Main Author | |
---|---|
Format | Journal Article |
Language | English |
Published |
10.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We introduce Breadth-First Pipeline Parallelism, a novel training schedule
which optimizes the combination of pipeline and data parallelism. Breadth-First
Pipeline Parallelism lowers training time, cost and memory usage by combining a
high GPU utilization with a small batch size per GPU, and by making use of
fully sharded data parallelism. Experimentally, we observed an increase of up
to 43% in training throughput for a 52 billion-parameter model using a small
batch size per GPU compared to Megatron-LM, which would reduce the training
time and cost by the same amount on a large GPU cluster. |
---|---|
DOI: | 10.48550/arxiv.2211.05953 |