Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning
Oftentimes, environments for sequential decision-making problems can be quite sparse in the provision of evaluative feedback to guide reinforcement-learning agents. In the extreme case, long trajectories of behavior are merely punctuated with a single terminal feedback signal, leading to a significa...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
21.07.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Oftentimes, environments for sequential decision-making problems can be quite
sparse in the provision of evaluative feedback to guide reinforcement-learning
agents. In the extreme case, long trajectories of behavior are merely
punctuated with a single terminal feedback signal, leading to a significant
temporal delay between the observation of a non-trivial reward and the
individual steps of behavior culpable for achieving said reward. Coping with
such a credit assignment challenge is one of the hallmark characteristics of
reinforcement learning. While prior work has introduced the concept of
hindsight policies to develop a theoretically moxtivated method for reweighting
on-policy data by impact on achieving the observed trajectory return, we show
that these methods experience instabilities which lead to inefficient learning
in complex environments. In this work, we adapt existing importance-sampling
ratio estimation techniques for off-policy evaluation to drastically improve
the stability and efficiency of these so-called hindsight policy methods. Our
hindsight distribution correction facilitates stable, efficient learning across
a broad range of environments where credit assignment plagues baseline methods. |
---|---|
DOI: | 10.48550/arxiv.2307.11897 |