Disentangling feature and lazy training in deep neural networks

Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the m...

Full description

Saved in:
Bibliographic Details
Published inJournal of statistical mechanics Vol. 2020; no. 11; pp. 113301 - 113327
Main Authors Geiger, Mario, Spigler, Stefano, Jacot, Arthur, Wyart, Matthieu
Format Journal Article
LanguageEnglish
Published IOP Publishing and SISSA 01.11.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as 1h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δF∼1/h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t1∼hα, such that for t ≪ t1 the dynamics is linear. At t ∼ t1, the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows ||ΔΘ||∼(hα)−a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings.
AbstractList Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as 1h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δF∼1/h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t1∼hα, such that for t ≪ t1 the dynamics is linear. At t ∼ t1, the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows ||ΔΘ||∼(hα)−a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings.
Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h . In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh −1/2 at initialization. By varying α and h , we probe the crossover between the two limits. We observe two the previously identified regimes of ‘lazy training’ and ‘feature training’. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α * that scales as 1 h . (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δ F ∼ 1 / h , leading to a performance that increases with h . The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t 1 ∼ h α , such that for t ≪ t 1 the dynamics is linear. At t ∼ t 1 , the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows | | Δ Θ | | ∼ ( h α ) − a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings.
Author Jacot, Arthur
Geiger, Mario
Wyart, Matthieu
Spigler, Stefano
Author_xml – sequence: 1
  givenname: Mario
  surname: Geiger
  fullname: Geiger, Mario
  organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland
– sequence: 2
  givenname: Stefano
  surname: Spigler
  fullname: Spigler, Stefano
  organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland
– sequence: 3
  givenname: Arthur
  surname: Jacot
  fullname: Jacot, Arthur
  organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland
– sequence: 4
  givenname: Matthieu
  surname: Wyart
  fullname: Wyart, Matthieu
  email: matthieu.wyart@epfl.ch
  organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland
BookMark eNp9kFtLAzEQhYNUsK2--7g_wLVJ9mL2SaReoeCLPofZZFJS1-ySpEj99WapiAgVBs4wM9-BMzMycb1DQs4ZvWRUiAW7KnlelbVYQKtKjUdk-jOa_OpPyCyEDaUFp6WYkutbG9BFcOvOunVmEOLWYwZOZx187rLowbpxY12mEYfM4dZDlyR-9P4tnJJjA13As2-dk9f7u5flY756fnha3qxyVVRNzKExSrVCM6gVLxvghvGkhoEWyEWheCMQuao4rZuGty1WlaasVdiWlFEs5oTufZXvQ_Bo5ODtO_idZFSOD5BjQjkmlPsHJKT-gygbIdrejaG6_8CLPWj7QW76rXcpmdyEBEtOecJYqqKgTA7aHDg_6P4FSEaAyw
CODEN JSMTC6
CitedBy_id crossref_primary_10_1073_pnas_2316301121
crossref_primary_10_1002_cpa_22200
crossref_primary_10_1088_1742_5468_ad642b
crossref_primary_10_1038_s41467_024_55229_3
crossref_primary_10_1103_PhysRevApplied_21_064027
crossref_primary_10_1063_5_0147231
crossref_primary_10_1088_1742_5468_ad01b9
crossref_primary_10_1088_1742_5468_abf1f3
crossref_primary_10_1103_PhysRevE_105_064118
crossref_primary_10_1073_pnas_2311805121
crossref_primary_10_7554_eLife_79908
crossref_primary_10_1088_2632_2153_ac4f3f
crossref_primary_10_1016_j_neunet_2024_106179
crossref_primary_10_1103_PhysRevE_105_044306
crossref_primary_10_1103_PhysRevResearch_4_013201
crossref_primary_10_1088_1742_5468_aceb4f
crossref_primary_10_7554_eLife_93060
crossref_primary_10_1088_1742_5468_ac98ac
crossref_primary_10_1088_1742_5468_ad01b0
crossref_primary_10_1007_s13735_023_00318_0
crossref_primary_10_7554_eLife_93060_3
crossref_primary_10_1088_1742_5468_ad292a
crossref_primary_10_1016_j_physrep_2021_04_001
crossref_primary_10_1038_s42256_023_00772_9
crossref_primary_10_1016_j_physa_2022_128152
Cites_doi 10.5244/C.30.87
ContentType Journal Article
Copyright 2020 IOP Publishing Ltd and SISSA Medialab srl
Copyright_xml – notice: 2020 IOP Publishing Ltd and SISSA Medialab srl
DBID AAYXX
CITATION
DOI 10.1088/1742-5468/abc4de
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
CrossRef
DeliveryMethod fulltext_linktorsrc
Discipline Physics
DocumentTitleAlternate Disentangling feature and lazy training in deep neural networks
EISSN 1742-5468
ExternalDocumentID 10_1088_1742_5468_abc4de
jstatabc4de
GroupedDBID 1JI
5B3
5GY
5VS
5ZH
7.M
7.Q
AAGCD
AAGID
AAJIO
AAJKP
AATNI
ABCXL
ABJNI
ABQJV
ABVAM
ACAFW
ACGFO
ACGFS
ACHIP
ADWVK
AEFHF
AFYNE
AKPSB
ALMA_UNASSIGNED_HOLDINGS
AOAED
ASPBG
ATQHT
AVWKF
AZFZN
CBCFC
CEBXE
CJUJL
CRLBU
CS3
EBS
EDWGO
EMSAF
EPQRW
EQZZN
HAK
IHE
IJHAN
IOP
IZVLO
J9A
KOT
LAP
M45
MV1
N5L
N9A
P2P
PJBAE
RIN
RNS
ROL
RPA
S3P
SY9
VSI
W28
XPP
ZMT
AAYXX
ADACN
ADEQX
CITATION
ID FETCH-LOGICAL-c359t-a9fccb8d1a6c249a2f12249f1ad8e283c298ee2c5206992bbe55d01bceb4010e3
IEDL.DBID IOP
ISSN 1742-5468
IngestDate Thu Apr 24 22:52:53 EDT 2025
Tue Jul 01 03:22:31 EDT 2025
Wed Aug 21 03:38:16 EDT 2024
Thu Jan 07 14:56:12 EST 2021
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 11
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c359t-a9fccb8d1a6c249a2f12249f1ad8e283c298ee2c5206992bbe55d01bceb4010e3
Notes JSTAT_003P_0620
OpenAccessLink http://infoscience.epfl.ch/record/282180
PageCount 27
ParticipantIDs crossref_primary_10_1088_1742_5468_abc4de
crossref_citationtrail_10_1088_1742_5468_abc4de
iop_journals_10_1088_1742_5468_abc4de
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2020-11-01
PublicationDateYYYYMMDD 2020-11-01
PublicationDate_xml – month: 11
  year: 2020
  text: 2020-11-01
  day: 01
PublicationDecade 2020
PublicationTitle Journal of statistical mechanics
PublicationTitleAbbrev JSTAT
PublicationTitleAlternate J. Stat. Mech
PublicationYear 2020
Publisher IOP Publishing and SISSA
Publisher_xml – name: IOP Publishing and SISSA
References Neal (jstatabc4debib23) 2019
Jacot (jstatabc4debib4) 2018; NIPS’18
Dyer (jstatabc4debib11) 2019
Sirignano (jstatabc4debib28) 2018
Paccolat (jstatabc4debib16) 2020
Zagoruyko (jstatabc4debib34) 2016
Lee (jstatabc4debib18) 2018
Chizat (jstatabc4debib10) 2019
Geiger (jstatabc4debib14) 2019
Nguyen (jstatabc4debib21) 2019
Du (jstatabc4debib12) 2019
Neyshabur (jstatabc4debib24) 2017
Yang (jstatabc4debib33) 2019
Han (jstatabc4debib15) 2017
Spigler (jstatabc4debib29) 2018
Matthews (jstatabc4debib20) 2018
Neal (jstatabc4debib22) 1996
Song (jstatabc4debib31) 2019
Jacot (jstatabc4debib5) 2019
Bansal (jstatabc4debib6) 2018
Baity-Jesi (jstatabc4debib7) 2018; vol 80
Novak (jstatabc4debib25) 2019
Chizat (jstatabc4debib8) 2018; vol 31
Chizat (jstatabc4debib9) 2019
Lee (jstatabc4debib19) 2019
Geiger (jstatabc4debib13) 2018
Rotskoff (jstatabc4debib27) 2018
Advani (jstatabc4debib1) 2017
Song (jstatabc4debib30) 2018
Kingma (jstatabc4debib17) 2015
Arora (jstatabc4debib3) 2019
Allen-Zhu (jstatabc4debib2) 2018
Park (jstatabc4debib26) 2019
Williams (jstatabc4debib32) 1997
References_xml – volume: vol 31
  start-page: 3040
  year: 2018
  ident: jstatabc4debib8
  article-title: On the global convergence of gradient descent for over-parameterized models using optimal transport
– year: 2019
  ident: jstatabc4debib26
  article-title: The effect of network width on stochastic gradient descent and generalization: an empirical study
– volume: NIPS’18
  start-page: 8580
  year: 2018
  ident: jstatabc4debib4
  article-title: Neural tangent kernel: convergence and generalization in neural networks
– volume: vol 80
  start-page: 314
  year: 2018
  ident: jstatabc4debib7
  article-title: Comparing dynamics: deep neural networks versus glassy systems
– year: 2019
  ident: jstatabc4debib3
  article-title: On exact computation with an infinitely wide neural ne
– year: 2020
  ident: jstatabc4debib16
  article-title: Geometric compression of invariant manifolds in neural nets
– year: 2018
  ident: jstatabc4debib28
  article-title: Mean field analysis of neural networks
– year: 2017
  ident: jstatabc4debib24
  article-title: Geometry of optimization and implicit regularization in deep learning
– year: 2019
  ident: jstatabc4debib14
  article-title: Scaling description of generalization with number of parameters in deep learning
– year: 2019
  ident: jstatabc4debib11
  article-title: Asymptotics of wide networks from Feynman diagrams
– year: 2019
  ident: jstatabc4debib12
  article-title: Gradient descent provably optimizes over-parameterized neural networks
– year: 2018
  ident: jstatabc4debib18
  article-title: Deep neural networks as Gaussian processes
– year: 2019
  ident: jstatabc4debib23
  article-title: A modern take on the bias-variance tradeoff in neural networks
– year: 2016
  ident: jstatabc4debib34
  article-title: Wide residual networks
  doi: 10.5244/C.30.87
– year: 2019
  ident: jstatabc4debib21
  article-title: Mean field limit of the learning dynamics of multilayer neural networks
– year: 1996
  ident: jstatabc4debib22
– year: 2018
  ident: jstatabc4debib13
  article-title: The jamming transition as a paradigm to understand the loss landscape of deep neural networks
– year: 2015
  ident: jstatabc4debib17
  article-title: Adam: a method for stochastic optimization
– year: 2019
  ident: jstatabc4debib33
  article-title: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation
– year: 2017
  ident: jstatabc4debib15
  article-title: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms
– year: 2019
  ident: jstatabc4debib5
  article-title: The asymptotic spectrum of the Hessian of DNN throughout training
– year: 2018
  ident: jstatabc4debib20
  article-title: Gaussian process behaviour in wide deep neural networks
– year: 2018
  ident: jstatabc4debib27
  article-title: Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error
– year: 2017
  ident: jstatabc4debib1
  article-title: High-dimensional dynamics of generalization error in neural networks
– year: 2019
  ident: jstatabc4debib19
  article-title: Wide neural networks of any depth evolve as linear models under gradient descent
– year: 2019
  ident: jstatabc4debib25
  article-title: Bayesian deep convolutional networks with many channels are Gaussian processes
– year: 2018
  ident: jstatabc4debib29
  article-title: A jamming transition from under-to over-parametrization affects loss landscape and generalization
– year: 2018
  ident: jstatabc4debib2
  article-title: A convergence theory for deep learning via over-parameterization
– year: 2019
  ident: jstatabc4debib10
  article-title: On lazy training in differentiable programming
– year: 2019
  ident: jstatabc4debib9
  article-title: A note on lazy training in supervised differentiable programming
– start-page: 295
  year: 1997
  ident: jstatabc4debib32
  article-title: Computing with infinite networks
– year: 2018
  ident: jstatabc4debib6
  article-title: Minnorm training: an algorithm for training overcomplete deep neural networks
– year: 2019
  ident: jstatabc4debib31
  article-title: Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit
– year: 2018
  ident: jstatabc4debib30
  article-title: A mean field view of the landscape of two-layers neural networks
SSID ssj0032048
Score 2.5285904
Snippet Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural...
Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h . In the neural...
SourceID crossref
iop
SourceType Enrichment Source
Index Database
Publisher
StartPage 113301
SubjectTerms deep learning
machine learning
Title Disentangling feature and lazy training in deep neural networks
URI https://iopscience.iop.org/article/10.1088/1742-5468/abc4de
Volume 2020
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEF5qRfDiW6wv9qAHD2mb3STu4kFELVXwcbDQgxD2MZFqSYNND_bXu5tNiopIEXLIYWazDNnZGWa-bxA64kRpInXogS-EFwignhTAvRBoyCxjmy8t3vnuPur2gtt-2K-hsxkWZpSVrr9pXh1RsDNh2RDHWiaGtvT9EWsJqQINC2iRsiiy4wtuHh4rN0wtI21Zl_xN69s9tGC-9eVa6ayi52pDrpvkrTnJZVNNf3A1_nPHa2ilDDfxhRNdRzVIN9BS0fapxpvo_GpQgI8sljd9wQkUNJ9YpBoPxfQDVwMk8CDFGiDDlv7SrJe65vHxFup1rp8uu145UsFTNOS5J3iilGTaF5EyiZcgia2s8cQXmoGJNBThDICokLQjzomUEIa67UsF0iRibaDbqJ6OUthB2Jxd4x6UlkCDINCMCQikVMkptYQwIBuoVRk4ViXfuN31MC7q3ozF1iyxNUvszNJAJzONzHFt_CF7bKwdlwduPLfcq8VpxcQEySb3MQ81Di7OdLI753p7aNkqOzTiPqrn7xM4MGFJLg-L3-8TWOPeiA
linkProvider IOP Publishing
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LT8MwDI7YEIgLb8R45gAHDl3XpinpCSHGtPEYOzBpt5CHiwZTV7HtwH49SdshQAhNQuqhB8e1nMaxZfszQieRr7QvNXXAE8IJBBBHCogcCoQyi9jmSdvvfN8Om93gpkd7xZzTrBdmmBamv2pec6DgXIVFQRxzjQ9t4ftD5gqpAg1uquMSWqQkJBY8v_XQmZliYlFpi9zkbyu_3UUl870vV0tjDT3NhMorSl6rk7GsqukPvMZ_SL2OVgu3E1_m5BtoAZJNtJSVf6rRFrqo97MmJNvTmzzjGDK4TywSjQdi-o5ngyRwP8EaIMUWBtPwS_Ii8tE26jauH6-aTjFawVGERmNHRLFSkmlPhMoEYMKPbYYtij2hGRiPQ_kRA_AV9WthFPlSAqW65kkF0gRkNSA7qJwME9hF2JxhYyaUlkCCINCMCQikVPE5scAwICvInSmZqwJ33Eo94Fn-mzFuVcOtaniumgo6-1yR5pgbf9CeGo3z4uCN5qZ7sf1a3DfOsomBzEOMoeNmS_bm5HeMljv1Br9rtW_30YrlkzcoHqDy-G0Ch8ZTGcuj7G_8AC0d4-w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Disentangling+feature+and+lazy+training+in+deep+neural+networks&rft.jtitle=Journal+of+statistical+mechanics&rft.au=Geiger%2C+Mario&rft.au=Spigler%2C+Stefano&rft.au=Jacot%2C+Arthur&rft.au=Wyart%2C+Matthieu&rft.date=2020-11-01&rft.issn=1742-5468&rft.eissn=1742-5468&rft.volume=2020&rft.issue=11&rft.spage=113301&rft_id=info:doi/10.1088%2F1742-5468%2Fabc4de&rft.externalDBID=n%2Fa&rft.externalDocID=10_1088_1742_5468_abc4de
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1742-5468&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1742-5468&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1742-5468&client=summon