Distributed Newton Methods for Deep Neural Networks
Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive...
Saved in:
Published in | Neural computation Vol. 30; no. 6; pp. 1673 - 1724 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
One Rogers Street, Cambridge, MA 02142-1209, USA
MIT Press
01.06.2018
MIT Press Journals, The |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Deep learning involves a difficult nonconvex optimization problem with a large
number of weights between any two adjacent layers of a deep structure. To handle
large data sets or complicated networks, distributed training is needed, but the
calculation of function, gradient, and Hessian is expensive. In particular, the
communication and the synchronization cost may become a bottleneck. In this
letter, we focus on situations where the model is distributedly stored and
propose a novel distributed Newton method for training deep neural networks. By
variable and feature-wise data partitions and some careful designs, we are able
to explicitly use the Jacobian matrix for matrix-vector products in the Newton
method. Some techniques are incorporated to reduce the running time as well as
memory consumption. First, to reduce the communication cost, we propose a
diagonalization method such that an approximate Newton direction can be obtained
without communication between machines. Second, we consider subsampled
Gauss-Newton matrices for reducing the running time as well as the communication
cost. Third, to reduce the synchronization cost, we terminate the process of
finding an approximate Newton direction even though some nodes have not finished
their tasks. Details of some implementation issues in distributed environments
are thoroughly investigated. Experiments demonstrate that the proposed method is
effective for the distributed training of deep neural networks. Compared with
stochastic gradient methods, it is more robust and may give better test
accuracy. |
---|---|
Bibliography: | June, 2018 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0899-7667 1530-888X |
DOI: | 10.1162/neco_a_01088 |