Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective

We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field sc...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on artificial intelligence Vol. 1; no. 2; pp. 121 - 129
Main Authors	Wojtowytsch, Stephan, E, Weinan
Format	Journal Article
Language	English
Published	IEEE 01.10.2020
Subjects	Approximation methods Artificial neural networks Biological neural networks classification and regression Optimization Sociology Statistics supervised learning Task analysis Training
Online Access	Get full text
ISSN	2691-4581 2691-4581
DOI	10.1109/TAI.2021.3051357

Cover

Loading…

More Information
Summary:	We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field scaling. The loss functional is mean squared error with a Lipschitz-continuous target function and data distributed uniformly on the <inline-formula><tex-math notation="LaTeX">d</tex-math></inline-formula>-dimensional unit cube. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks. Impact Statement -Artificial neural networks perform well in many real life applications, but may suffer from the curse of dimensionality on certain problems. We provide theoretical and numerical evidence that this may be related to whether a target function lies in the hypothesis class described by infinitely wide networks. The training dynamics are considered in the fully non-linear regime and not reduced to neural tangent kernels. We believe that it will be essential to study these hypothesis classes in detail to choose an appropriate machine learning models for a given problem. The goal of the article is to illustrate this in a mathematically sound and numerically convincing fashion.
ISSN:	2691-4581 2691-4581
DOI:	10.1109/TAI.2021.3051357