Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective
We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field sc...
Saved in:
Published in | IEEE transactions on artificial intelligence Vol. 1; no. 2; pp. 121 - 129 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
IEEE
01.10.2020
|
Subjects | |
Online Access | Get full text |
ISSN | 2691-4581 2691-4581 |
DOI | 10.1109/TAI.2021.3051357 |
Cover
Loading…
Summary: | We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field scaling. The loss functional is mean squared error with a Lipschitz-continuous target function and data distributed uniformly on the <inline-formula><tex-math notation="LaTeX">d</tex-math></inline-formula>-dimensional unit cube. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks.
Impact Statement -Artificial neural networks perform well in many real life applications, but may suffer from the curse of dimensionality on certain problems. We provide theoretical and numerical evidence that this may be related to whether a target function lies in the hypothesis class described by infinitely wide networks. The training dynamics are considered in the fully non-linear regime and not reduced to neural tangent kernels. We believe that it will be essential to study these hypothesis classes in detail to choose an appropriate machine learning models for a given problem. The goal of the article is to illustrate this in a mathematically sound and numerically convincing fashion. |
---|---|
ISSN: | 2691-4581 2691-4581 |
DOI: | 10.1109/TAI.2021.3051357 |