Gsyn: Reducing Staleness and Communication Waiting via Grouping-based Synchronization for Distributed Deep Learning
Distributed deep learning has been widely employed to train deep neural network over large-scale dataset. However, the commonly used parameter server architecture suffers from long synchronization time in data-parallel training. Although the existing solutions are proposed to reduce synchronization...
Saved in:
Published in | IEEE INFOCOM 2024 - IEEE Conference on Computer Communications pp. 1731 - 1740 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
20.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Distributed deep learning has been widely employed to train deep neural network over large-scale dataset. However, the commonly used parameter server architecture suffers from long synchronization time in data-parallel training. Although the existing solutions are proposed to reduce synchronization overhead by breaking the synchronization barriers or limiting the staleness bound, they inevitably experience low convergence efficiency and long synchronization waiting. To address these problems, we propose Gsyn to reduce both synchronization overhead and staleness. Specifically, Gsyn divides workers into multiple groups. The workers in the same group coordinate with each other using the bulk synchronous parallel scheme to achieve high convergence efficiency, and each group communicates with parameter server asynchronously to reduce the synchronization waiting time, consequently increasing the convergence efficiency. Furthermore, we theoretically analyze the optimal number of groups to achieve a good tradeoff between staleness and synchronization waiting. The evaluation test in the realistic cluster with multiple training tasks demonstrates that Gsyn is beneficial and accelerates distributed training by up to 27% over the state-of-the-art solutions. |
---|---|
ISSN: | 2641-9874 |
DOI: | 10.1109/INFOCOM52122.2024.10621250 |