SGD Distributional Dynamics of Three Layer Neural Networks

With the rise of big data analytics, multi-layer neural networks have surfaced as one of the most powerful machine learning methods. However, their theoretical mathematical properties are still not fully understood. Training a neural network requires optimizing a non-convex objective function, typic...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Luo, Victor, Wang, Yazhen, Fung, Glenn
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 30.12.2020
Subjects	Machine learning Multilayers Neural networks Nonlinear differential equations Nonlinear equations Partial differential equations
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the rise of big data analytics, multi-layer neural networks have surfaced as one of the most powerful machine learning methods. However, their theoretical mathematical properties are still not fully understood. Training a neural network requires optimizing a non-convex objective function, typically done using stochastic gradient descent (SGD). In this paper, we seek to extend the mean field results of Mei et al. (2018) from two-layer neural networks with one hidden layer to three-layer neural networks with two hidden layers. We will show that the SGD dynamics is captured by a set of non-linear partial differential equations, and prove that the distributions of weights in the two hidden layers are independent. We will also detail exploratory work done based on simulation and real-world data.
ISSN:	2331-8422