Learning Adversarial MDPs with Stochastic Hard Constraints
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative po...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
06.03.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2403.03672 |
Cover
Summary: | We study online learning in constrained Markov decision processes (CMDPs)
with adversarial losses and stochastic hard constraints, under bandit feedback.
We consider three scenarios. In the first one, we address general CMDPs, where
we design an algorithm attaining sublinear regret and cumulative positive
constraints violation. In the second scenario, under the mild assumption that a
policy strictly satisfying the constraints exists and is known to the learner,
we design an algorithm that achieves sublinear regret while ensuring that
constraints are satisfied at every episode with high probability. In the last
scenario, we only assume the existence of a strictly feasible policy, which is
not known to the learner, and we design an algorithm attaining sublinear regret
and constant cumulative positive constraints violation. Finally, we show that
in the last two scenarios, a dependence on the Slater's parameter is
unavoidable. To the best of our knowledge, our work is the first to study CMDPs
involving both adversarial losses and hard constraints. Thus, our algorithms
can deal with general non-stationary environments subject to requirements much
stricter than those manageable with existing ones, enabling their adoption in a
much wider range of applications. |
---|---|
DOI: | 10.48550/arxiv.2403.03672 |