Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation
Proceedings of the 41 st International Conference on Machine Learning, 2024 We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data....
Saved in:
Main Authors | , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
31.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Proceedings of the 41 st International Conference on Machine
Learning, 2024 We prove that the combination of a target network and over-parameterized
linear function approximation establishes a weaker convergence condition for
bootstrapped value estimation in certain cases, even with off-policy data. Our
condition is naturally satisfied for expected updates over the entire
state-action space or learning with a batch of complete trajectories from
episodic Markov decision processes. Notably, using only a target network or an
over-parameterized model does not provide such a convergence guarantee.
Additionally, we extend our results to learning with truncated trajectories,
showing that convergence is achievable for all tasks with minor modifications,
akin to value truncation for the final states in trajectories. Our primary
result focuses on temporal difference estimation for prediction, providing
high-probability value estimation error bounds and empirical analysis on
Baird's counterexample and a Four-room task. Furthermore, we explore the
control setting, demonstrating that similar convergence conditions apply to
Q-learning. |
---|---|
DOI: | 10.48550/arxiv.2405.21043 |