A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training

Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capabilit...

Full description

Saved in:

Bibliographic Details
Published in	2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) pp. 696 - 703
Main Authors	Naganuma, Hiroki, Yokota, Rio
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2019
Subjects	improvement-of-generalization-performance large-scale-deep-learning second-order-optimization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capability by training with extremely large amount of data and model parameters, which exceeds the predictions of classical learning theory. One drawback of this is that training of DNN requires enormous calculation time. Therefore, it is necessary to reduce the training time through large scale parallelization. Straightforward data-parallelization of DNN degrades convergence and generalization. In the present work, we investigate the possibility of using second order methods to solve this generalization gap in large-batch training. This is motivated by our observation that each mini-batch becomes more statistically stable, and thus the effect of considering the curvature plays a more important role in large-batch training. We have also found that naively adapting the natural gradient method causes the generalization performance to deteriorate further due to the lack of regularization capability. We propose an improved second order method by smoothing the loss function, which allows second-order methods to generalize as well as mini-batch SGD.
DOI:	10.1109/CCGRID.2019.00092