A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce

Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a g rouping A llReduce method based on...

Full description

Saved in:

Bibliographic Details
Published in	Data science and engineering Vol. 8; no. 1; pp. 61 - 72
Main Authors	Wang, Guozheng, Lei, Yongmei, Zhang, Zeyu, Peng, Cunlu
Format	Journal Article
Language	English
Published	Singapore Springer Nature Singapore 01.03.2023 Springer Nature B.V SpringerOpen
Subjects	ADMM Algorithm Analysis and Problem Complexity Algorithms Artificial Intelligence Bandwidths Chemistry and Earth Sciences Communication Computer Science Data Mining and Knowledge Discovery Database Management Efficiency Grouping AllReduce Lagrange multiplier Machine learning Mathematical models Parameters Physics Regression analysis Research Papers Statistics for Engineering Synchronism Synchronous algorithm Systems and Data Security Time synchronization Topology Toruses Training Two-dimensional torus topology Workers Grouping AllReduce Synchronous algorithm Two-dimensional torus topology ADMM
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Large-scale distributed training mainly consists of sub-model parallel training and parameter synchronization. With the expansion of training workers, the efficiency of parameter synchronization will be affected. To tackle this problem, we first propose 2D-TGA, a g rouping A llReduce method based on the two-dimensional t orus topology. This method synchronizes the model parameters by grouping and makes full use of bandwidth. Secondly, we propose a distributed algorithm, 2D-TGA-ADMM, which combines the 2D-TGA with the alternating direction method of multipliers (ADMM). It focuses on sub-model training and reduces the wait time among workers in the synchronization process. Finally, experimental results on the Tianhe-2 supercomputing platform show that compared with the MPI _ Allreduce , the 2D-TGA could shorten the synchronization wait time by 33 % .
ISSN:	2364-1185 2364-1541
DOI:	10.1007/s41019-022-00202-7