Efficient Reinforcement Learning in Probabilistic Reward Machines
In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
19.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we study reinforcement learning in Markov Decision Processes with Probabilistic Reward Machines (PRMs), a form of non-Markovian reward commonly found in robotics tasks. We design an algorithm for PRMs that achieves a regret bound of \(\widetilde{O}(\sqrt{HOAT} + H^2O^2A^{3/2} + H\sqrt{T})\), where \(H\) is the time horizon, \(O\) is the number of observations, \(A\) is the number of actions, and \(T\) is the number of time-steps. This result improves over the best-known bound, \(\widetilde{O}(H\sqrt{OAT})\) of \citet{pmlr-v206-bourel23a} for MDPs with Deterministic Reward Machines (DRMs), a special case of PRMs. When \(T \geq H^3O^3A^2\) and \(OA \geq H\), our regret bound leads to a regret of \(\widetilde{O}(\sqrt{HOAT})\), which matches the established lower bound of \(\Omega(\sqrt{HOAT})\) for MDPs with DRMs up to a logarithmic factor. To the best of our knowledge, this is the first efficient algorithm for PRMs. Additionally, we present a new simulation lemma for non-Markovian rewards, which enables reward-free exploration for any non-Markovian reward given access to an approximate planner. Complementing our theoretical findings, we show through extensive experiment evaluations that our algorithm indeed outperforms prior methods in various PRM environments. |
---|---|
ISSN: | 2331-8422 |