Robust On-Policy Sampling for Data-Efficient Policy Evaluation in Reinforcement Learning
Reinforcement learning (RL) algorithms are often categorized as either on-policy or off-policy depending on whether they use data from a target policy of interest or from a different behavior policy. In this paper, we study a subtle distinction between on-policy data and on-policy sampling in the co...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
29.11.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Reinforcement learning (RL) algorithms are often categorized as either
on-policy or off-policy depending on whether they use data from a target policy
of interest or from a different behavior policy. In this paper, we study a
subtle distinction between on-policy data and on-policy sampling in the context
of the RL sub-problem of policy evaluation. We observe that on-policy sampling
may fail to match the expected distribution of on-policy data after observing
only a finite number of trajectories and this failure hinders data-efficient
policy evaluation. Towards improved data-efficiency, we show how non-i.i.d.,
off-policy sampling can produce data that more closely matches the expected
on-policy data distribution and consequently increases the accuracy of the
Monte Carlo estimator for policy evaluation. We introduce a method called
Robust On-Policy Sampling and demonstrate theoretically and empirically that it
produces data that converges faster to the expected on-policy distribution
compared to on-policy sampling. Empirically, we show that this faster
convergence leads to lower mean squared error policy value estimates. |
---|---|
DOI: | 10.48550/arxiv.2111.14552 |