Distributed Newton Methods for Deep Neural Networks

Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive...

Full description

Saved in:

Bibliographic Details
Published in	Neural computation Vol. 30; no. 6; pp. 1673 - 1724
Main Authors	Wang, Chien-Chih, Tan, Kent Loong, Chen, Chun-Ting, Lin, Yu-Hsiang, Keerthi, S. Sathiya, Mahajan, Dhruv, Sundararajan, S., Lin, Chih-Jen
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.06.2018 MIT Press Journals, The
Subjects	Artificial intelligence Artificial neural networks Communication Convex analysis Jacobi matrix method Jacobian matrix Letters Machine learning Mathematical analysis Matrix algebra Matrix methods Methods Neural networks Neutrality Newton methods Nonlinear programming Optimization Synchronism Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this letter, we focus on situations where the model is distributedly stored and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. Compared with stochastic gradient methods, it is more robust and may give better test accuracy.
Bibliography:	June, 2018 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0899-7667 1530-888X
DOI:	10.1162/neco_a_01088