Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective

We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field sc...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on artificial intelligence Vol. 1; no. 2; pp. 121 - 129
Main Authors Wojtowytsch, Stephan, E, Weinan
Format Journal Article
LanguageEnglish
Published IEEE 01.10.2020
Subjects
Online AccessGet full text
ISSN2691-4581
2691-4581
DOI10.1109/TAI.2021.3051357

Cover

Loading…
More Information
Summary:We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than <inline-formula><tex-math notation="LaTeX">t^{-4/(d-2)}</tex-math></inline-formula> under mean field scaling. The loss functional is mean squared error with a Lipschitz-continuous target function and data distributed uniformly on the <inline-formula><tex-math notation="LaTeX">d</tex-math></inline-formula>-dimensional unit cube. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks. Impact Statement -Artificial neural networks perform well in many real life applications, but may suffer from the curse of dimensionality on certain problems. We provide theoretical and numerical evidence that this may be related to whether a target function lies in the hypothesis class described by infinitely wide networks. The training dynamics are considered in the fully non-linear regime and not reduced to neural tangent kernels. We believe that it will be essential to study these hypothesis classes in detail to choose an appropriate machine learning models for a given problem. The goal of the article is to illustrate this in a mathematically sound and numerically convincing fashion.
ISSN:2691-4581
2691-4581
DOI:10.1109/TAI.2021.3051357