Speeding Up Distributed Machine Learning Using Codes

Codes are widely used in many engineering applications to offer robustness against noise . In large-scale systems, there are several types of noise that can affect the performance of distributed machine learning algorithms-straggler nodes, system failures, or communication bottlenecks-but there has...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on information theory Vol. 64; no. 3; pp. 1514 - 1529
Main Authors Lee, Kangwook, Lam, Maximilian, Pedarsani, Ramtin, Papailiopoulos, Dimitris, Ramchandran, Kannan
Format Journal Article
LanguageEnglish
Published New York IEEE 01.03.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Codes are widely used in many engineering applications to offer robustness against noise . In large-scale systems, there are several types of noise that can affect the performance of distributed machine learning algorithms-straggler nodes, system failures, or communication bottlenecks-but there has been little interaction cutting across codes, machine learning, and distributed systems. In this paper, we provide theoretical insights on how coded solutions can achieve significant gains compared with uncoded ones. We focus on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling . For matrix multiplication, we use codes to alleviate the effect of stragglers and show that if the number of homogeneous workers is <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula>, and the runtime of each subtask has an exponential tail, coded computation can speed up distributed matrix multiplication by a factor of <inline-formula> <tex-math notation="LaTeX">\log n </tex-math></inline-formula>. For data shuffling, we use codes to reduce communication bottlenecks, exploiting the excess in storage. We show that when a constant fraction <inline-formula> <tex-math notation="LaTeX">\alpha </tex-math></inline-formula> of the data matrix can be cached at each worker, and <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> is the number of workers, coded shuffling reduces the communication cost by a factor of <inline-formula> <tex-math notation="LaTeX">\left({\alpha + \frac {1}{n}}\right)\gamma (n) </tex-math></inline-formula> compared with uncoded shuffling, where <inline-formula> <tex-math notation="LaTeX">\gamma (n) </tex-math></inline-formula> is the ratio of the cost of unicasting <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> messages to <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> users to multicasting a common message (of the same size) to <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> users. For instance, <inline-formula> <tex-math notation="LaTeX">\gamma (n) \simeq n </tex-math></inline-formula> if multicasting a message to <inline-formula> <tex-math notation="LaTeX">n </tex-math></inline-formula> users is as cheap as unicasting a message to one user. We also provide experimental results, corroborating our theoretical gains of the coded algorithms.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9448
1557-9654
DOI:10.1109/TIT.2017.2736066