Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data

Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and i...

Full description

Saved in:
Bibliographic Details
Main Authors Song, Bowen, Iannelli, Andrea
Format Journal Article
LanguageEnglish
Published 27.02.2025
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2502.19977

Cover

Abstract Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and in such cases, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a formal investigation that encompasses uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. One contribution of this work is the study of the robustness of model-free PG methods, aiming to identify their limitations in the presence of noise and propose improvements to enhance their applicability. Numerical simulations show that these theoretical analyses provide valuable guidance in tuning the algorithm parameters, thereby making these methods more reliable in practically relevant scenarios.
AbstractList Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a gradient-based approach, PG methods typically rely on knowledge of the system dynamics. However, this information is not always available, and in such cases, trajectory data can be utilized to approximate first-order information. When the data are noisy, gradient estimates become inaccurate and a formal investigation that encompasses uncertainty estimation and the analysis of its propagation through the algorithm is currently missing. To address this, our work focuses on the Linear Quadratic Regulator (LQR) problem for systems subject to additive stochastic noise. After briefly summarizing the state of the art for cases with a known model, we focus on scenarios where the system dynamics are unknown, and approximate gradient information is obtained using zeroth-order optimization techniques. We analyze the theoretical properties by computing the error in the estimated gradient and examining how this error affects the convergence of PG algorithms. Additionally, we provide global convergence guarantees for various versions of PG methods, including those employing adaptive step sizes and variance reduction techniques, which help increase the convergence rate and reduce sample complexity. One contribution of this work is the study of the robustness of model-free PG methods, aiming to identify their limitations in the presence of noise and propose improvements to enhance their applicability. Numerical simulations show that these theoretical analyses provide valuable guidance in tuning the algorithm parameters, thereby making these methods more reliable in practically relevant scenarios.
Author Iannelli, Andrea
Song, Bowen
Author_xml – sequence: 1
  givenname: Bowen
  surname: Song
  fullname: Song, Bowen
– sequence: 2
  givenname: Andrea
  surname: Iannelli
  fullname: Iannelli, Andrea
BackLink https://doi.org/10.48550/arXiv.2502.19977$$DView paper in arXiv
BookMark eNqFzrsOgjAUgOEOOnh7ACfPC4CAEmRGxUESb4sTOYFTaYKtaSvK2xuJu9O__MM3ZD2pJDE29T13uQpDb476LRo3CL3A9eM4igbsmijZkL6RLAjSJ2qUlsiA4pCpkmqHayI4qFoULaQaS0HSQka2UqUBrjTsjyd4CVvB2aqiQmNFAWu0OGZ9jrWhya8jNttuLsnO6RD5Q4s76jb_YvIOs_h_fACe6UEv
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2502.19977
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2502_19977
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2502_199773
IEDL.DBID GOX
IngestDate Wed Jul 23 00:22:27 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2502_199773
OpenAccessLink https://arxiv.org/abs/2502.19977
ParticipantIDs arxiv_primary_2502_19977
PublicationCentury 2000
PublicationDate 2025-02-27
PublicationDateYYYYMMDD 2025-02-27
PublicationDate_xml – month: 02
  year: 2025
  text: 2025-02-27
  day: 27
PublicationDecade 2020
PublicationYear 2025
Score 3.8046758
SecondaryResourceType preprint
Snippet Policy gradient (PG) methods are the backbone of many reinforcement learning algorithms due to their good performance in policy optimization problems. As a...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Systems and Control
Title Convergence Guarantees of Model-free Policy Gradient Methods for LQR with Stochastic Data
URI https://arxiv.org/abs/2502.19977
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLa2nbggEKDx9oFrxJp0pDuiwTYhBuIllVOVpK5AQgy1AfHzidMhuOyaWJETH_zZsT8DnASPyCzjqTCSpEiTrBQhbJaidIqycpSWCXEecn5zNntKr_Jh3gH87YUx9ffrV8sPbJvT4J8l99Np3YWulBxcTW_z9nMyUnEt5f_kAsaMS_-cxGQD1pfoDs9bc2xCh9634HnMld2xyZGQTcKXoQYXFfIksjdR1UTYEvTitI41WB7ncbRzgwFU4vXdPXLCFB_8wr0YplbGC-PNNhxPLh_HMxGVKT5a5oiC9SyinmoHeiG-pz5goqxxNKDMBWyvbDIq7dA6VQUgr91A613orzplb_XWPqxJHlXL3df6AHq-_qTD4D-9PYqP-AP3PHTD
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Convergence+Guarantees+of+Model-free+Policy+Gradient+Methods+for+LQR+with+Stochastic+Data&rft.au=Song%2C+Bowen&rft.au=Iannelli%2C+Andrea&rft.date=2025-02-27&rft_id=info:doi/10.48550%2Farxiv.2502.19977&rft.externalDocID=2502_19977