ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library...

Full description

Saved in:
Bibliographic Details
Published inIEEE MICRO Vol. 41; no. 5; pp. 85 - 92
Main Authors Dong, Jianbo, Wang, Shaochuang, Feng, Fei, Cao, Zheng, Pan, Heng, Tang, Lingbo, Li, Pengcheng, Li, Hao, Ran, Qianyuan, Guo, Yiqun, Gao, Shanyuan, Long, Xin, Zhang, Jie, Li, Yong, Xia, Zhisheng, Song, Liuyihan, Zhang, Yingya, Pan, Pan, Wang, Guohui, Jiang, Xiaowei
Format Journal Article
LanguageEnglish
Published Los Alamitos IEEE 01.09.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.
ISSN:0272-1732
1937-4143
DOI:10.1109/MM.2021.3091475