ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library
Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library...
Saved in:
Published in | IEEE MICRO Vol. 41; no. 5; pp. 85 - 92 |
---|---|
Main Authors | , , , , , , , , , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Los Alamitos
IEEE
01.09.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement. |
---|---|
ISSN: | 0272-1732 1937-4143 |
DOI: | 10.1109/MM.2021.3091475 |