Disentangling feature and lazy training in deep neural networks
Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the m...
Saved in:
Published in | Journal of statistical mechanics Vol. 2020; no. 11; pp. 113301 - 113327 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
IOP Publishing and SISSA
01.11.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as 1h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δF∼1/h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t1∼hα, such that for t ≪ t1 the dynamics is linear. At t ∼ t1, the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows ||ΔΘ||∼(hα)−a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings. |
---|---|
AbstractList | Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh−1/2 at initialization. By varying α and h, we probe the crossover between the two limits. We observe two the previously identified regimes of 'lazy training' and 'feature training'. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α* that scales as 1h. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δF∼1/h, leading to a performance that increases with h. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t1∼hα, such that for t ≪ t1 the dynamics is linear. At t ∼ t1, the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows ||ΔΘ||∼(hα)−a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings. Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h . In the neural tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel Θ (the NTK). By contrast, in the mean-field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as αh −1/2 at initialization. By varying α and h , we probe the crossover between the two limits. We observe two the previously identified regimes of ‘lazy training’ and ‘feature training’. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and thus learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that: (i) the two regimes are separated by an α * that scales as 1 h . (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations δF induced on the learned function by initial conditions decay as δ F ∼ 1 / h , leading to a performance that increases with h . The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks that are trained independently. (iv) In the feature-training regime we identify a time scale t 1 ∼ h α , such that for t ≪ t 1 the dynamics is linear. At t ∼ t 1 , the output has grown by a magnitude h and the changes of the tangent kernel | |ΔΘ| | become significant. Ultimately, it follows | | Δ Θ | | ∼ ( h α ) − a for ReLU and Softplus activation functions, with a < 2 and a → 2 as depth grows. We provide scaling arguments supporting these findings. |
Author | Jacot, Arthur Geiger, Mario Wyart, Matthieu Spigler, Stefano |
Author_xml | – sequence: 1 givenname: Mario surname: Geiger fullname: Geiger, Mario organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland – sequence: 2 givenname: Stefano surname: Spigler fullname: Spigler, Stefano organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland – sequence: 3 givenname: Arthur surname: Jacot fullname: Jacot, Arthur organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland – sequence: 4 givenname: Matthieu surname: Wyart fullname: Wyart, Matthieu email: matthieu.wyart@epfl.ch organization: École Polytechnique Fédérale de Lausanne, Route Cantonale, 1015 Lausanne, Switzerland |
BookMark | eNp9kFtLAzEQhYNUsK2--7g_wLVJ9mL2SaReoeCLPofZZFJS1-ySpEj99WapiAgVBs4wM9-BMzMycb1DQs4ZvWRUiAW7KnlelbVYQKtKjUdk-jOa_OpPyCyEDaUFp6WYkutbG9BFcOvOunVmEOLWYwZOZx187rLowbpxY12mEYfM4dZDlyR-9P4tnJJjA13As2-dk9f7u5flY756fnha3qxyVVRNzKExSrVCM6gVLxvghvGkhoEWyEWheCMQuao4rZuGty1WlaasVdiWlFEs5oTufZXvQ_Bo5ODtO_idZFSOD5BjQjkmlPsHJKT-gygbIdrejaG6_8CLPWj7QW76rXcpmdyEBEtOecJYqqKgTA7aHDg_6P4FSEaAyw |
CODEN | JSMTC6 |
CitedBy_id | crossref_primary_10_1073_pnas_2316301121 crossref_primary_10_1002_cpa_22200 crossref_primary_10_1088_1742_5468_ad642b crossref_primary_10_1038_s41467_024_55229_3 crossref_primary_10_1103_PhysRevApplied_21_064027 crossref_primary_10_1063_5_0147231 crossref_primary_10_1088_1742_5468_ad01b9 crossref_primary_10_1088_1742_5468_abf1f3 crossref_primary_10_1103_PhysRevE_105_064118 crossref_primary_10_1073_pnas_2311805121 crossref_primary_10_7554_eLife_79908 crossref_primary_10_1088_2632_2153_ac4f3f crossref_primary_10_1016_j_neunet_2024_106179 crossref_primary_10_1103_PhysRevE_105_044306 crossref_primary_10_1103_PhysRevResearch_4_013201 crossref_primary_10_1088_1742_5468_aceb4f crossref_primary_10_7554_eLife_93060 crossref_primary_10_1088_1742_5468_ac98ac crossref_primary_10_1088_1742_5468_ad01b0 crossref_primary_10_1007_s13735_023_00318_0 crossref_primary_10_7554_eLife_93060_3 crossref_primary_10_1088_1742_5468_ad292a crossref_primary_10_1016_j_physrep_2021_04_001 crossref_primary_10_1038_s42256_023_00772_9 crossref_primary_10_1016_j_physa_2022_128152 |
Cites_doi | 10.5244/C.30.87 |
ContentType | Journal Article |
Copyright | 2020 IOP Publishing Ltd and SISSA Medialab srl |
Copyright_xml | – notice: 2020 IOP Publishing Ltd and SISSA Medialab srl |
DBID | AAYXX CITATION |
DOI | 10.1088/1742-5468/abc4de |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
DocumentTitleAlternate | Disentangling feature and lazy training in deep neural networks |
EISSN | 1742-5468 |
ExternalDocumentID | 10_1088_1742_5468_abc4de jstatabc4de |
GroupedDBID | 1JI 5B3 5GY 5VS 5ZH 7.M 7.Q AAGCD AAGID AAJIO AAJKP AATNI ABCXL ABJNI ABQJV ABVAM ACAFW ACGFO ACGFS ACHIP ADWVK AEFHF AFYNE AKPSB ALMA_UNASSIGNED_HOLDINGS AOAED ASPBG ATQHT AVWKF AZFZN CBCFC CEBXE CJUJL CRLBU CS3 EBS EDWGO EMSAF EPQRW EQZZN HAK IHE IJHAN IOP IZVLO J9A KOT LAP M45 MV1 N5L N9A P2P PJBAE RIN RNS ROL RPA S3P SY9 VSI W28 XPP ZMT AAYXX ADACN ADEQX CITATION |
ID | FETCH-LOGICAL-c359t-a9fccb8d1a6c249a2f12249f1ad8e283c298ee2c5206992bbe55d01bceb4010e3 |
IEDL.DBID | IOP |
ISSN | 1742-5468 |
IngestDate | Thu Apr 24 22:52:53 EDT 2025 Tue Jul 01 03:22:31 EDT 2025 Wed Aug 21 03:38:16 EDT 2024 Thu Jan 07 14:56:12 EST 2021 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 11 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c359t-a9fccb8d1a6c249a2f12249f1ad8e283c298ee2c5206992bbe55d01bceb4010e3 |
Notes | JSTAT_003P_0620 |
OpenAccessLink | http://infoscience.epfl.ch/record/282180 |
PageCount | 27 |
ParticipantIDs | crossref_primary_10_1088_1742_5468_abc4de crossref_citationtrail_10_1088_1742_5468_abc4de iop_journals_10_1088_1742_5468_abc4de |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2020-11-01 |
PublicationDateYYYYMMDD | 2020-11-01 |
PublicationDate_xml | – month: 11 year: 2020 text: 2020-11-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | Journal of statistical mechanics |
PublicationTitleAbbrev | JSTAT |
PublicationTitleAlternate | J. Stat. Mech |
PublicationYear | 2020 |
Publisher | IOP Publishing and SISSA |
Publisher_xml | – name: IOP Publishing and SISSA |
References | Neal (jstatabc4debib23) 2019 Jacot (jstatabc4debib4) 2018; NIPS’18 Dyer (jstatabc4debib11) 2019 Sirignano (jstatabc4debib28) 2018 Paccolat (jstatabc4debib16) 2020 Zagoruyko (jstatabc4debib34) 2016 Lee (jstatabc4debib18) 2018 Chizat (jstatabc4debib10) 2019 Geiger (jstatabc4debib14) 2019 Nguyen (jstatabc4debib21) 2019 Du (jstatabc4debib12) 2019 Neyshabur (jstatabc4debib24) 2017 Yang (jstatabc4debib33) 2019 Han (jstatabc4debib15) 2017 Spigler (jstatabc4debib29) 2018 Matthews (jstatabc4debib20) 2018 Neal (jstatabc4debib22) 1996 Song (jstatabc4debib31) 2019 Jacot (jstatabc4debib5) 2019 Bansal (jstatabc4debib6) 2018 Baity-Jesi (jstatabc4debib7) 2018; vol 80 Novak (jstatabc4debib25) 2019 Chizat (jstatabc4debib8) 2018; vol 31 Chizat (jstatabc4debib9) 2019 Lee (jstatabc4debib19) 2019 Geiger (jstatabc4debib13) 2018 Rotskoff (jstatabc4debib27) 2018 Advani (jstatabc4debib1) 2017 Song (jstatabc4debib30) 2018 Kingma (jstatabc4debib17) 2015 Arora (jstatabc4debib3) 2019 Allen-Zhu (jstatabc4debib2) 2018 Park (jstatabc4debib26) 2019 Williams (jstatabc4debib32) 1997 |
References_xml | – volume: vol 31 start-page: 3040 year: 2018 ident: jstatabc4debib8 article-title: On the global convergence of gradient descent for over-parameterized models using optimal transport – year: 2019 ident: jstatabc4debib26 article-title: The effect of network width on stochastic gradient descent and generalization: an empirical study – volume: NIPS’18 start-page: 8580 year: 2018 ident: jstatabc4debib4 article-title: Neural tangent kernel: convergence and generalization in neural networks – volume: vol 80 start-page: 314 year: 2018 ident: jstatabc4debib7 article-title: Comparing dynamics: deep neural networks versus glassy systems – year: 2019 ident: jstatabc4debib3 article-title: On exact computation with an infinitely wide neural ne – year: 2020 ident: jstatabc4debib16 article-title: Geometric compression of invariant manifolds in neural nets – year: 2018 ident: jstatabc4debib28 article-title: Mean field analysis of neural networks – year: 2017 ident: jstatabc4debib24 article-title: Geometry of optimization and implicit regularization in deep learning – year: 2019 ident: jstatabc4debib14 article-title: Scaling description of generalization with number of parameters in deep learning – year: 2019 ident: jstatabc4debib11 article-title: Asymptotics of wide networks from Feynman diagrams – year: 2019 ident: jstatabc4debib12 article-title: Gradient descent provably optimizes over-parameterized neural networks – year: 2018 ident: jstatabc4debib18 article-title: Deep neural networks as Gaussian processes – year: 2019 ident: jstatabc4debib23 article-title: A modern take on the bias-variance tradeoff in neural networks – year: 2016 ident: jstatabc4debib34 article-title: Wide residual networks doi: 10.5244/C.30.87 – year: 2019 ident: jstatabc4debib21 article-title: Mean field limit of the learning dynamics of multilayer neural networks – year: 1996 ident: jstatabc4debib22 – year: 2018 ident: jstatabc4debib13 article-title: The jamming transition as a paradigm to understand the loss landscape of deep neural networks – year: 2015 ident: jstatabc4debib17 article-title: Adam: a method for stochastic optimization – year: 2019 ident: jstatabc4debib33 article-title: Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation – year: 2017 ident: jstatabc4debib15 article-title: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms – year: 2019 ident: jstatabc4debib5 article-title: The asymptotic spectrum of the Hessian of DNN throughout training – year: 2018 ident: jstatabc4debib20 article-title: Gaussian process behaviour in wide deep neural networks – year: 2018 ident: jstatabc4debib27 article-title: Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error – year: 2017 ident: jstatabc4debib1 article-title: High-dimensional dynamics of generalization error in neural networks – year: 2019 ident: jstatabc4debib19 article-title: Wide neural networks of any depth evolve as linear models under gradient descent – year: 2019 ident: jstatabc4debib25 article-title: Bayesian deep convolutional networks with many channels are Gaussian processes – year: 2018 ident: jstatabc4debib29 article-title: A jamming transition from under-to over-parametrization affects loss landscape and generalization – year: 2018 ident: jstatabc4debib2 article-title: A convergence theory for deep learning via over-parameterization – year: 2019 ident: jstatabc4debib10 article-title: On lazy training in differentiable programming – year: 2019 ident: jstatabc4debib9 article-title: A note on lazy training in supervised differentiable programming – start-page: 295 year: 1997 ident: jstatabc4debib32 article-title: Computing with infinite networks – year: 2018 ident: jstatabc4debib6 article-title: Minnorm training: an algorithm for training overcomplete deep neural networks – year: 2019 ident: jstatabc4debib31 article-title: Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit – year: 2018 ident: jstatabc4debib30 article-title: A mean field view of the landscape of two-layers neural networks |
SSID | ssj0032048 |
Score | 2.5285904 |
Snippet | Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h. In the neural... Two distinct limits for deep learning have been derived as the network width h → ∞, depending on how the weights of the last layer scale with h . In the neural... |
SourceID | crossref iop |
SourceType | Enrichment Source Index Database Publisher |
StartPage | 113301 |
SubjectTerms | deep learning machine learning |
Title | Disentangling feature and lazy training in deep neural networks |
URI | https://iopscience.iop.org/article/10.1088/1742-5468/abc4de |
Volume | 2020 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEF5qRfDiW6wv9qAHD2mb3STu4kFELVXwcbDQgxD2MZFqSYNND_bXu5tNiopIEXLIYWazDNnZGWa-bxA64kRpInXogS-EFwignhTAvRBoyCxjmy8t3vnuPur2gtt-2K-hsxkWZpSVrr9pXh1RsDNh2RDHWiaGtvT9EWsJqQINC2iRsiiy4wtuHh4rN0wtI21Zl_xN69s9tGC-9eVa6ayi52pDrpvkrTnJZVNNf3A1_nPHa2ilDDfxhRNdRzVIN9BS0fapxpvo_GpQgI8sljd9wQkUNJ9YpBoPxfQDVwMk8CDFGiDDlv7SrJe65vHxFup1rp8uu145UsFTNOS5J3iilGTaF5EyiZcgia2s8cQXmoGJNBThDICokLQjzomUEIa67UsF0iRibaDbqJ6OUthB2Jxd4x6UlkCDINCMCQikVMkptYQwIBuoVRk4ViXfuN31MC7q3ozF1iyxNUvszNJAJzONzHFt_CF7bKwdlwduPLfcq8VpxcQEySb3MQ81Di7OdLI753p7aNkqOzTiPqrn7xM4MGFJLg-L3-8TWOPeiA |
linkProvider | IOP Publishing |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LT8MwDI7YEIgLb8R45gAHDl3XpinpCSHGtPEYOzBpt5CHiwZTV7HtwH49SdshQAhNQuqhB8e1nMaxZfszQieRr7QvNXXAE8IJBBBHCogcCoQyi9jmSdvvfN8Om93gpkd7xZzTrBdmmBamv2pec6DgXIVFQRxzjQ9t4ftD5gqpAg1uquMSWqQkJBY8v_XQmZliYlFpi9zkbyu_3UUl870vV0tjDT3NhMorSl6rk7GsqukPvMZ_SL2OVgu3E1_m5BtoAZJNtJSVf6rRFrqo97MmJNvTmzzjGDK4TywSjQdi-o5ngyRwP8EaIMUWBtPwS_Ii8tE26jauH6-aTjFawVGERmNHRLFSkmlPhMoEYMKPbYYtij2hGRiPQ_kRA_AV9WthFPlSAqW65kkF0gRkNSA7qJwME9hF2JxhYyaUlkCCINCMCQikVPE5scAwICvInSmZqwJ33Eo94Fn-mzFuVcOtaniumgo6-1yR5pgbf9CeGo3z4uCN5qZ7sf1a3DfOsomBzEOMoeNmS_bm5HeMljv1Br9rtW_30YrlkzcoHqDy-G0Ch8ZTGcuj7G_8AC0d4-w |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Disentangling+feature+and+lazy+training+in+deep+neural+networks&rft.jtitle=Journal+of+statistical+mechanics&rft.au=Geiger%2C+Mario&rft.au=Spigler%2C+Stefano&rft.au=Jacot%2C+Arthur&rft.au=Wyart%2C+Matthieu&rft.date=2020-11-01&rft.issn=1742-5468&rft.eissn=1742-5468&rft.volume=2020&rft.issue=11&rft.spage=113301&rft_id=info:doi/10.1088%2F1742-5468%2Fabc4de&rft.externalDBID=n%2Fa&rft.externalDocID=10_1088_1742_5468_abc4de |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1742-5468&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1742-5468&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1742-5468&client=summon |