SGD Distributional Dynamics of Three Layer Neural Networks
With the rise of big data analytics, multi-layer neural networks have surfaced as one of the most powerful machine learning methods. However, their theoretical mathematical properties are still not fully understood. Training a neural network requires optimizing a non-convex objective function, typic...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
29.12.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | With the rise of big data analytics, multi-layer neural networks have
surfaced as one of the most powerful machine learning methods. However, their
theoretical mathematical properties are still not fully understood. Training a
neural network requires optimizing a non-convex objective function, typically
done using stochastic gradient descent (SGD). In this paper, we seek to extend
the mean field results of Mei et al. (2018) from two-layer neural networks with
one hidden layer to three-layer neural networks with two hidden layers. We will
show that the SGD dynamics is captured by a set of non-linear partial
differential equations, and prove that the distributions of weights in the two
hidden layers are independent. We will also detail exploratory work done based
on simulation and real-world data. |
---|---|
DOI: | 10.48550/arxiv.2012.15036 |