Greedy-Step Off-Policy Reinforcement Learning

Most of the policy evaluation algorithms are based on the theories of Bellman Expectation and Optimality Equation, which derive two popular approaches - Policy Iteration (PI) and Value Iteration (VI). However, multi-step bootstrapping is often at cross-purposes with and off-policy learning in PI-bas...

Full description

Saved in:

Bibliographic Details
Main Authors	Wang, Yuhui, Wu, Qingyuan, He, Pengcheng, Tan, Xiaoyang
Format	Journal Article
Language	English
Published	23.02.2021
Subjects	Computer Science - Artificial Intelligence Computer Science - Learning Computer Science - Multiagent Systems
Online Access	Get full text
DOI	10.48550/arxiv.2102.11717

Cover

Loading…

More Information
Summary:	Most of the policy evaluation algorithms are based on the theories of Bellman Expectation and Optimality Equation, which derive two popular approaches - Policy Iteration (PI) and Value Iteration (VI). However, multi-step bootstrapping is often at cross-purposes with and off-policy learning in PI-based methods due to the large variance of multi-step off-policy correction. In contrast, VI-based methods are naturally off-policy but subject to one-step learning.In this paper, we deduce a novel multi-step Bellman Optimality Equation by utilizing a latent structure of multi-step bootstrapping with the optimal value function. Via this new equation, we derive a new multi-step value iteration method that converges to the optimal value function with exponential contraction rate $\mathcal{O}(\gamma^n)$ but only linear computational complexity. Moreover, it can naturally derive a suite of multi-step off-policy algorithms that can safely utilize data collected by arbitrary policies without correction.Experiments reveal that the proposed methods are reliable, easy to implement and achieve state-of-the-art performance on a series of standard benchmark datasets.
DOI:	10.48550/arxiv.2102.11717