The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion

In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows...

Full description

Saved in:
Bibliographic Details
Published inNeural computation Vol. 36; no. 1; p. 151
Main Authors Kunin, Daniel, Sagastuy-Brena, Javier, Gillespie, Lauren, Margalit, Eshed, Tanaka, Hidenori, Ganguli, Surya, Yamins, Daniel L K
Format Journal Article
LanguageEnglish
Published United States 12.12.2023
Online AccessGet more information

Cover

Loading…
Abstract In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction among the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase-space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents that cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD. Understanding the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future work that can turn these insights into algorithmic gains.
AbstractList In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance traveled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction among the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase-space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents that cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD. Understanding the limiting dynamics of SGD, and its dependence on various important hyperparameters like batch size, learning rate, and momentum, can serve as a basis for future work that can turn these insights into algorithmic gains.
Author Margalit, Eshed
Kunin, Daniel
Yamins, Daniel L K
Tanaka, Hidenori
Sagastuy-Brena, Javier
Ganguli, Surya
Gillespie, Lauren
Author_xml – sequence: 1
  givenname: Daniel
  surname: Kunin
  fullname: Kunin, Daniel
  email: kunin@stanford.edu
  organization: Stanford University, Stanford, CA 94305, U.S.A. kunin@stanford.edu
– sequence: 2
  givenname: Javier
  surname: Sagastuy-Brena
  fullname: Sagastuy-Brena, Javier
  email: jvrsgsty@stanford.edu
  organization: Stanford University, Stanford, CA 94305, U.S.A. jvrsgsty@stanford.edu
– sequence: 3
  givenname: Lauren
  surname: Gillespie
  fullname: Gillespie, Lauren
  email: gillespl@stanford.edu
  organization: Stanford University, Stanford, CA 94305, U.S.A. gillespl@stanford.edu
– sequence: 4
  givenname: Eshed
  surname: Margalit
  fullname: Margalit, Eshed
  email: eshedm@stanford.edu
  organization: Stanford University, Stanford, CA 94305, U.S.A. eshedm@stanford.edu
– sequence: 5
  givenname: Hidenori
  surname: Tanaka
  fullname: Tanaka, Hidenori
  email: hidenori.tanaka@ntt-research.com
  organization: NTT Research, Sunnyvale, CA 94085, U.S.A. hidenori.tanaka@ntt-research.com
– sequence: 6
  givenname: Surya
  surname: Ganguli
  fullname: Ganguli, Surya
  email: sganguli@stanford.edu
  organization: Facebook AI Research, Menlo Park, CA 94025, U.S.A. sganguli@stanford.edu
– sequence: 7
  givenname: Daniel L K
  surname: Yamins
  fullname: Yamins, Daniel L K
  email: yamins@stanford.edu
  organization: Stanford University, Stanford, CA 94305, U.S.A. yamins@stanford.edu
BackLink https://www.ncbi.nlm.nih.gov/pubmed/38052080$$D View this record in MEDLINE/PubMed
BookMark eNo1j11LwzAYRoMo7kPvvJb8AKtvkjbNvBvrnEJlwiZ4ZUnTNy7SJmNpL_bvHahX54EDD5wJOffBIyE3DO4Zk_zBowmVruC05RkZs0xAopT6GJFJjN8AIBlkl2QkFGQcFIzJ53aHtHSd653_osXR686ZSIOlm1XxSF9D46zDhpYhxjv6ttMRk81eG6TraFzb6t4FfzLaN3TuQ6fbMERaOGuHeDJX5MLqNuL1H6fk_Wm5XTwn5Xr1spiXieG57BMmU5szriETs3xW5woxU8pqJVFCkxumUYqaccHTOgWBKZczgypLQVkpGPIpuf393Q91h021P7hOH47Vfyj_ASc1VAs
ContentType Journal Article
Copyright 2023 Massachusetts Institute of Technology.
Copyright_xml – notice: 2023 Massachusetts Institute of Technology.
DBID NPM
DOI 10.1162/neco_a_01626
DatabaseName PubMed
DatabaseTitle PubMed
DatabaseTitleList PubMed
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Computer Science
EISSN 1530-888X
ExternalDocumentID 38052080
Genre Journal Article
GroupedDBID ---
-~X
.DC
0R~
123
36B
4.4
6IK
AAJGR
AALMD
ABDBF
ABDNZ
ABIVO
ABJNI
ACGFO
AEGXH
AEILP
AENEX
AFHIN
AIAGR
ALMA_UNASSIGNED_HOLDINGS
AVWKF
AZFZN
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EAP
EBS
EPS
EST
ESX
F5P
FEDTE
FNEHJ
HZ~
IPLJI
JAVBF
MCG
MINIK
MKJ
NPM
O9-
OCL
P2P
PK0
PQQKQ
RMI
WG8
WH7
XJE
ZWS
ID FETCH-LOGICAL-c276t-164f712a053979b78ee588fa86e60d7c1ae63b12324b403e4269ce85408f631e2
IngestDate Sat Nov 02 12:17:05 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License 2023 Massachusetts Institute of Technology.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c276t-164f712a053979b78ee588fa86e60d7c1ae63b12324b403e4269ce85408f631e2
PMID 38052080
ParticipantIDs pubmed_primary_38052080
PublicationCentury 2000
PublicationDate 2023-12-12
PublicationDateYYYYMMDD 2023-12-12
PublicationDate_xml – month: 12
  year: 2023
  text: 2023-12-12
  day: 12
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Neural computation
PublicationTitleAlternate Neural Comput
PublicationYear 2023
SSID ssj0006105
Score 2.4816906
Snippet In this work, we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after...
SourceID pubmed
SourceType Index Database
StartPage 151
Title The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion
URI https://www.ncbi.nlm.nih.gov/pubmed/38052080
Volume 36
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3LTtwwFLUGkCo2PFoo0IK86K4YEsdxPN1RhoeqliIB0qw6chK7gCAzYiYL-IR-da9faaBQFTbRKNZEUc7JfTj33IvQB14wkeYsIhK8PWG0ADuYaE3SXJRUZrRg0nb7POKHZ-xLP-13Or9aVUv1JN8q7h7VlbwEVTgHuBqV7DOQbS4KJ-A34AtHQBiO_42xVSiZfL_nZsvb0oyTg55J9b8NywttYsyvfjb68Tk4LXICebL6-B2c35WvhAs1nDvV8FpemarY3oXW9TiAdhl6PNkmHYWdBHH_E35duWYETrLe7NvIn3I8qW_J5xtV-apc44ibqh-jRByP_LRnWbeEaXYCL-QI1liPz70Gy-9P0MTUevjKaBVsakQg0e63ja7renKPXM6Cxq7_7N-WndtOsZCTD-QA4lSns2-BPLq2KCdmRkPkpkP9e_VBn-2wNIWmMmEs5pHZ9_E-HYLMNMgmON1u38YsehX--iA1sSHK6QKa87kF3nFEWUQdVb1G82FuB_Zm_A36AbzBgTc48AYPNQbefMKBNdiwZhO3OIPbnNnEwBjcMAY3jFlCZ_t7p7uHxM_ZIAXN-IRAxqyzmEqwx92sm2dCqVQILQVXPCqzIpaKJ7mNveGtTpRRP5tptywSmiexostouhpWagXhbpEKiPmZLBPOSq2F1jTWIhdMi5KV0Sp6657QYOSaqQzCs1t7cuUdmv3Dq_doRsPbq9YhFJzkGxal37jGXv4
link.rule.ids 783
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+Limiting+Dynamics+of+SGD%3A+Modified+Loss%2C+Phase-Space+Oscillations%2C+and+Anomalous+Diffusion&rft.jtitle=Neural+computation&rft.au=Kunin%2C+Daniel&rft.au=Sagastuy-Brena%2C+Javier&rft.au=Gillespie%2C+Lauren&rft.au=Margalit%2C+Eshed&rft.date=2023-12-12&rft.eissn=1530-888X&rft.volume=36&rft.issue=1&rft.spage=151&rft_id=info:doi/10.1162%2Fneco_a_01626&rft_id=info%3Apmid%2F38052080&rft_id=info%3Apmid%2F38052080&rft.externalDocID=38052080