Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms

In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that...

Full description

Saved in:

Bibliographic Details
Main Authors	Laroche, Romain, Tachet, Remi
Format	Journal Article
Language	English
Published	15.02.2022
Subjects	Computer Science - Artificial Intelligence Computer Science - Learning Mathematics - Optimization and Control Statistics - Machine Learning
Online Access	Get full text

Cover

Loading…

Abstract	In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
AbstractList	In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets evolve with time and the policy optimization process must be efficient at unlearning what it previously learnt. In this paper, we discover that the policy gradient theorem prescribes policy updates that are slow to unlearn because of their structural symmetry with respect to the value target. To increase the unlearning speed, we study a novel policy update: the gradient of the cross-entropy loss with respect to the action maximizing $q$, but find that such updates may lead to a decrease in value. Consequently, we introduce a modified policy update devoid of that flaw, and prove its guarantees of convergence to global optimality in $\mathcal{O}(t^{-1})$ under classic assumptions. Further, we assess standard policy updates and our cross-entropy policy updates along six analytical dimensions. Finally, we empirically validate our theoretical findings.
Author	Tachet, Remi Laroche, Romain
Author_xml	– sequence: 1 givenname: Romain surname: Laroche fullname: Laroche, Romain – sequence: 2 givenname: Remi surname: Tachet fullname: Tachet, Remi
BackLink	https://doi.org/10.48550/arXiv.2202.07496$$DView paper in arXiv
BookMark	eNotj8tOhDAYhbvQhY4-gCv7AmDp5S9dIhlHk0l0gRs3pPQiTYBOCjHy9irO6pycfDnJd40upjg5hO4KkvNSCPKg03f4yiklNCeSK7hCH49ujZPFS-_wWxyCWfEhaRvctOCmdzG5EfuY8N77YLb1TL2frF7cjMOEK7PElNUpLMHgaviMv60f5xt06fUwu9tz7lDztG_q5-z4enipq2OmQUIGXFDHSwZMAzHWgVLKawu69LzTouikBaZcyYxgJQXwUtGOCykLcIRLy3bo_v92k2tPKYw6re2fZLtJsh8Zz04N
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY AKZ EPD GOX
DOI	10.48550/arxiv.2202.07496
DatabaseName	arXiv Computer Science arXiv Mathematics arXiv Statistics arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2202_07496
GroupedDBID	AKY AKZ EPD GOX
ID	FETCH-LOGICAL-a676-6452e48363a60cde6999fad6a8f4ba51b7d639e83c538266f792b457716e047d3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:49:40 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a676-6452e48363a60cde6999fad6a8f4ba51b7d639e83c538266f792b457716e047d3
OpenAccessLink	https://arxiv.org/abs/2202.07496
ParticipantIDs	arxiv_primary_2202_07496
PublicationCentury	2000
PublicationDate	2022-02-15
PublicationDateYYYYMMDD	2022-02-15
PublicationDate_xml	– month: 02 year: 2022 text: 2022-02-15 day: 15
PublicationDecade	2020
PublicationYear	2022
Score	1.8383411
SecondaryResourceType	preprint
Snippet	In Reinforcement Learning, the optimal action at a given state is dependent on policy decisions at subsequent states. As a consequence, the learning targets...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Learning Mathematics - Optimization and Control Statistics - Machine Learning
Title	Beyond the Policy Gradient Theorem for Efficient Policy Updates in Actor-Critic Algorithms
URI	https://arxiv.org/abs/2202.07496
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED21nVgQCFD5lAfWiMafyRihfggJWFopYqnsxIZKNK3agPj5nO0gWNgi56Zn2fds33sHcMu8jsdqmtAKp4HrFJdUplWSuppVOstTF3Tcj09ytuAPpSh7QH60MHr3tfqM_sBmf0ep99NUPJd96FPqS7amz2V8nAxWXF38bxxyzDD0J0lMjuCwY3ekiNNxDD3bnMBLVIkQpFok2vCS6S5UWrUkSOPtmiB1JOPg5uBHu6jF1h_H92TVkMLfrSexLwEp3l83-PW23p_CfDKe38-SrqVBoqWSiX9FtDxjkmk5qmorkZ45XUudOW60SI2qkTHYjFW4D2HqdCqnhguFhxo74qpmZzBoNo0dAhFW1cIKZJ0257nJEGHnmNN05KiuUncOwwDEchtdK5Yeo2XA6OL_X5dwQH19v-94Iq5g0O4-7DVm3dbcBOi_AXjzgi4
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Beyond+the+Policy+Gradient+Theorem+for+Efficient+Policy+Updates+in+Actor-Critic+Algorithms&rft.au=Laroche%2C+Romain&rft.au=Tachet%2C+Remi&rft.date=2022-02-15&rft_id=info:doi/10.48550%2Farxiv.2202.07496&rft.externalDocID=2202_07496