Upper Counterfactual Confidence Bounds: a New Optimism Principle for Contextual Bandits
The principle of optimism in the face of uncertainty is one of the most widely used and successful ideas in multi-armed bandits and reinforcement learning. However, existing optimistic algorithms (primarily UCB and its variants) often struggle to deal with general function classes and large context...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
15.07.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The principle of optimism in the face of uncertainty is one of the most
widely used and successful ideas in multi-armed bandits and reinforcement
learning. However, existing optimistic algorithms (primarily UCB and its
variants) often struggle to deal with general function classes and large
context spaces. In this paper, we study general contextual bandits with an
offline regression oracle and propose a simple, generic principle to design
optimistic algorithms, dubbed "Upper Counterfactual Confidence Bounds" (UCCB).
The key innovation of UCCB is building confidence bounds in policy space,
rather than in action space as is done in UCB. We demonstrate that these
algorithms are provably optimal and computationally efficient in handling
general function classes and large context spaces. Furthermore, we illustrate
that the UCCB principle can be seamlessly extended to infinite-action general
contextual bandits, provide the first solutions to these settings when
employing an offline regression oracle. |
---|---|
DOI: | 10.48550/arxiv.2007.07876 |