Gsyn: Reducing Staleness and Communication Waiting via Grouping-based Synchronization for Distributed Deep Learning

Distributed deep learning has been widely employed to train deep neural network over large-scale dataset. However, the commonly used parameter server architecture suffers from long synchronization time in data-parallel training. Although the existing solutions are proposed to reduce synchronization...

Full description

Saved in:

Bibliographic Details
Published in	IEEE INFOCOM 2024 - IEEE Conference on Computer Communications pp. 1731 - 1740
Main Authors	Li, Yijun, Huang, Jiawei, Li, Zhaoyi, Liu, Jingling, Zhou, Shengwen, Jiang, Wanchun, Wang, Jianxin
Format	Conference Proceeding
Language	English
Published	IEEE 20.05.2024
Subjects	Artificial neural networks Data Parallel Deep learning Distributed System Life estimation Limiting Servers Synchronization Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Distributed deep learning has been widely employed to train deep neural network over large-scale dataset. However, the commonly used parameter server architecture suffers from long synchronization time in data-parallel training. Although the existing solutions are proposed to reduce synchronization overhead by breaking the synchronization barriers or limiting the staleness bound, they inevitably experience low convergence efficiency and long synchronization waiting. To address these problems, we propose Gsyn to reduce both synchronization overhead and staleness. Specifically, Gsyn divides workers into multiple groups. The workers in the same group coordinate with each other using the bulk synchronous parallel scheme to achieve high convergence efficiency, and each group communicates with parameter server asynchronously to reduce the synchronization waiting time, consequently increasing the convergence efficiency. Furthermore, we theoretically analyze the optimal number of groups to achieve a good tradeoff between staleness and synchronization waiting. The evaluation test in the realistic cluster with multiple training tasks demonstrates that Gsyn is beneficial and accelerates distributed training by up to 27% over the state-of-the-art solutions.
ISSN:	2641-9874
DOI:	10.1109/INFOCOM52122.2024.10621250