Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that...

Full description

Saved in:
Bibliographic Details
Main Authors Laroche, Romain, Tachet, Remi
Format Journal Article
LanguageEnglish
Published 15.02.2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
AbstractList In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
Author Tachet, Remi
Laroche, Romain
Author_xml – sequence: 1
  givenname: Romain
  surname: Laroche
  fullname: Laroche, Romain
– sequence: 2
  givenname: Remi
  surname: Tachet
  fullname: Tachet, Remi
BackLink https://doi.org/10.48550/arXiv.2202.07496$$DView paper in arXiv
BookMark eNotj8tOhDAYhbvQhY4-gCv7AmDp5S9dIhlHk0l0gRs3pPQiTYBOCjHy9irO6pycfDnJd40upjg5hO4KkvNSCPKg03f4yiklNCeSK7hCH49ujZPFS-_wWxyCWfEhaRvctOCmdzG5EfuY8N77YLb1TL2frF7cjMOEK7PElNUpLMHgaviMv60f5xt06fUwu9tz7lDztG_q5-z4enipq2OmQUIGXFDHSwZMAzHWgVLKawu69LzTouikBaZcyYxgJQXwUtGOCykLcIRLy3bo_v92k2tPKYw6re2fZLtJsh8Zz04N
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
AKZ
EPD
GOX
DOI 10.48550/arxiv.2202.07496
DatabaseName arXiv Computer Science
arXiv Mathematics
arXiv Statistics
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2202_07496
GroupedDBID AKY
AKZ
EPD
GOX
ID FETCH-LOGICAL-a676-6452e48363a60cde6999fad6a8f4ba51b7d639e83c538266f792b457716e047d3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:49:40 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a676-6452e48363a60cde6999fad6a8f4ba51b7d639e83c538266f792b457716e047d3
OpenAccessLink https://arxiv.org/abs/2202.07496
ParticipantIDs arxiv_primary_2202_07496
PublicationCentury 2000
PublicationDate 2022-02-15
PublicationDateYYYYMMDD 2022-02-15
PublicationDate_xml – month: 02
  year: 2022
  text: 2022-02-15
  day: 15
PublicationDecade 2020
PublicationYear 2022
Score 1.8383411
SecondaryResourceType preprint
Snippet In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Learning
Mathematics - Optimization and Control
Statistics - Machine Learning
Title Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
URI https://arxiv.org/abs/2202.07496
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED21nVgQCFD5lAfWiMafyRihfggJWFopYqnsxIZKNK3agPj5nO0gWNgi56Zn2fds33sHcMu8jsdqmtAKp4HrFJdUplWSuppVOstTF3Tcj09ytuAPpSh7QH60MHr3tfqM_sBmf0ep99NUPJd96FPqS7amz2V8nAxWXF38bxxyzDD0J0lMjuCwY3ekiNNxDD3bnMBLVIkQpFok2vCS6S5UWrUkSOPtmiB1JOPg5uBHu6jF1h_H92TVkMLfrSexLwEp3l83-PW23p_CfDKe38-SrqVBoqWSiX9FtDxjkmk5qmorkZ45XUudOW60SI2qkTHYjFW4D2HqdCqnhguFhxo74qpmZzBoNo0dAhFW1cIKZJ0257nJEGHnmNN05KiuUncOwwDEchtdK5Yeo2XA6OL_X5dwQH19v-94Iq5g0O4-7DVm3dbcBOi_AXjzgi4
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Beyond+the+Policy+Gradient+Theorem+for+Efficient+Policy+Updates+in+Actor-Critic+Algorithms&rft.au=Laroche%2C+Romain&rft.au=Tachet%2C+Remi&rft.date=2022-02-15&rft_id=info:doi/10.48550%2Farxiv.2202.07496&rft.externalDocID=2202_07496