Disentangling feature and lazy training in deep neural networks

Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the m...

Full description

Saved in:
Bibliographic Details
Published inJournal of statistical mechanics Vol. 2020; no. 11; pp. 113301 - 113327
Main Authors Geiger, Mario, Spigler, Stefano, Jacot, Arthur, Wyart, Matthieu
Format Journal Article
LanguageEnglish
Published IOP Publishing and SISSA 01.11.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as 1h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δF∼1/h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t1∼hα, such that for t ≪ t1 the dynamics is linear. At t ∼ t1, the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows ||ΔΘ||∼(hα)−a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings.
Bibliography:JSTAT_003P_0620
ISSN:1742-5468
1742-5468
DOI:10.1088/1742-5468/abc4de