EP4DDL: addressing straggler problem in heterogeneous distributed deep learning

Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existenc...

Full description

Saved in:

Bibliographic Details
Published in	The Journal of supercomputing Vol. 78; no. 13; pp. 15663 - 15680
Main Authors	Ji, Zeyu, Zhang, Xingjun, Li, Jingbo, Wei, Jia, Wei, Zheng
Format	Journal Article
Language	English
Published	New York Springer US 01.09.2022 Springer Nature B.V
Subjects	Algorithms Compilers Computation Computer architecture Computer Science Deep learning Inspection Interpreters Machine learning Neural networks Processor Architectures Programming Languages Deep learning Flexible parallelism Distributed system Heterogeneous environment Stragglers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Driven by big data, neural networks evolve more complex and the computing capacity of a single machine is often difficult to meet the demand. Distributed deep learning technology has shown great performance superiority for handling this problem. However, a serious issue in this field is the existence of stragglers, which significantly restricts the performance of the whole system. It is an enormous challenge to fully exploit the computing capacity of the system based on parameter server architecture, especially in a heterogeneous environment. Motivated by this, we designed a method named EP4DDL to minimize the impact of the straggler problem by load balance technique. In a statistical view, the approach introduces a novel metric named performance variance to give a comprehensive inspection of stragglers and employs flexible parallelism techniques for each node. We verify the algorithm on standard benchmarks and demonstrate that it can reduce training time to 57.46%, 24.8%, and 11.5%, respectively, without accuracy loss compared with the FlexRR, Con-SGD, and Falcon.
ISSN:	0920-8542 1573-0484
DOI:	10.1007/s11227-022-04466-8