EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existenc...

Full description

Saved in:
Bibliographic Details
Published inThe Journal of supercomputing Vol. 78; no. 13; pp. 15663 - 15680
Main Authors Ji, Zeyu, Zhang, Xingjun, Li, Jingbo, Wei, Jia, Wei, Zheng
Format Journal Article
LanguageEnglish
Published New York Springer US 01.09.2022
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existence of stragglers, which significantly restricts the performance of the whole system. It is an enormous challenge to fully exploit the computing capacity of the system based on parameter server architecture, especially in a heterogeneous environment. Motivated by this, we designed a method named EP4DDL to minimize the impact of the straggler problem by load balance technique. In a statistical view, the approach introduces a novel metric named performance variance to give a comprehensive inspection of stragglers and employs flexible parallelism techniques for each node. We verify the algorithm on standard benchmarks and demonstrate that it can reduce training time to 57.46%, 24.8%, and 11.5%, respectively, without accuracy loss compared with the FlexRR, Con-SGD, and Falcon.
ISSN:0920-8542
1573-0484
DOI:10.1007/s11227-022-04466-8