ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library...

Full description

Saved in:

Bibliographic Details
Published in	IEEE MICRO Vol. 41; no. 5; pp. 85 - 92
Main Authors	Dong, Jianbo, Wang, Shaochuang, Feng, Fei, Cao, Zheng, Pan, Heng, Tang, Lingbo, Li, Pengcheng, Li, Hao, Ran, Qianyuan, Guo, Yiqun, Gao, Shanyuan, Long, Xin, Zhang, Jie, Li, Yong, Xia, Zhisheng, Song, Liuyihan, Zhang, Yingya, Pan, Pan, Wang, Guohui, Jiang, Xiaowei
Format	Journal Article
Language	English
Published	Los Alamitos IEEE 01.09.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Artificial neural networks Bandwidth Communication Computer networks Distributed processing Fabrics Libraries Parallel algorithms Payloads Servers Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2021.3091475