Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data

Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and i...

Full description

Saved in:

Bibliographic Details
Main Authors	Song, Bowen, Iannelli, Andrea
Format	Journal Article
Language	English
Published	27.02.2025
Subjects	Computer Science - Systems and Control
Online Access	Get full text
DOI	10.48550/arxiv.2502.19977

Cover

Abstract	Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and in such cases, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a formal investigation that encompasses uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. One contribution of this work is the study of the robustness of model-free PG methods, aiming to identify their limitations in the presence of noise and propose improvements to enhance their applicability. Numerical simulations show that these theoretical analyses provide valuable guidance in tuning the algorithm parameters, thereby making these methods more reliable in practically relevant scenarios.
AbstractList	Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and in such cases, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a formal investigation that encompasses uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. One contribution of this work is the study of the robustness of model-free PG methods, aiming to identify their limitations in the presence of noise and propose improvements to enhance their applicability. Numerical simulations show that these theoretical analyses provide valuable guidance in tuning the algorithm parameters, thereby making these methods more reliable in practically relevant scenarios.
Author	Iannelli, Andrea Song, Bowen
Author_xml	– sequence: 1 givenname: Bowen surname: Song fullname: Song, Bowen – sequence: 2 givenname: Andrea surname: Iannelli fullname: Iannelli, Andrea
BackLink	https://doi.org/10.48550/arXiv.2502.19977$$DView paper in arXiv
BookMark	eNqFzrsOgjAUgOEOOnh7ACfPC4CAEmRGxUESb4sTOYFTaYKtaSvK2xuJu9O__MM3ZD2pJDE29T13uQpDb476LRo3CL3A9eM4igbsmijZkL6RLAjSJ2qUlsiA4pCpkmqHayI4qFoULaQaS0HSQka2UqUBrjTsjyd4CVvB2aqiQmNFAWu0OGZ9jrWhya8jNttuLsnO6RD5Q4s76jb_YvIOs_h_fACe6UEv
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2502.19977
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2502_19977
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2502_199773
IEDL.DBID	GOX
IngestDate	Wed Jul 23 00:22:27 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2502_199773
OpenAccessLink	https://arxiv.org/abs/2502.19977
ParticipantIDs	arxiv_primary_2502_19977
PublicationCentury	2000
PublicationDate	2025-02-27
PublicationDateYYYYMMDD	2025-02-27
PublicationDate_xml	– month: 02 year: 2025 text: 2025-02-27 day: 27
PublicationDecade	2020
PublicationYear	2025
Score	3.8046758
SecondaryResourceType	preprint
Snippet	Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Systems and Control
Title	Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data
URI	https://arxiv.org/abs/2502.19977
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLa2nbggEKDx9oFrxJp0pDuiwTYhBuIllVOVpK5AQgy1AfHzidMhuOyaWJETH_zZsT8DnASPyCzjqTCSpEiTrBQhbJaidIqycpSWCXEecn5zNntKr_Jh3gH87YUx9ffrV8sPbJvT4J8l99Np3YWulBxcTW_z9nMyUnEt5f_kAsaMS_-cxGQD1pfoDs9bc2xCh9634HnMld2xyZGQTcKXoQYXFfIksjdR1UTYEvTitI41WB7ncbRzgwFU4vXdPXLCFB_8wr0YplbGC-PNNhxPLh_HMxGVKT5a5oiC9SyinmoHeiG-pz5goqxxNKDMBWyvbDIq7dA6VQUgr91A613orzplb_XWPqxJHlXL3df6AHq-_qTD4D-9PYqP-AP3PHTD
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Convergence+Guarantees+of+Model-free+Policy+Gradient+Methods+for+LQR+with+Stochastic+Data&rft.au=Song%2C+Bowen&rft.au=Iannelli%2C+Andrea&rft.date=2025-02-27&rft_id=info:doi/10.48550%2Farxiv.2502.19977&rft.externalDocID=2502_19977