Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative m...
Saved in:
Published in | Machine learning Vol. 91; no. 3; pp. 325 - 349 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Boston
Springer US
01.06.2013
Springer Springer Nature B.V Springer Verlag |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with
N
state-action pairs and the discount factor
γ
∈[0,1) only
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) state-transition samples are required to find an
ε
-optimal estimation of the action-value function with the probability (w.p.) 1−
δ
. Further, we prove that, for small values of
ε
, an order of
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) samples is required to find an
ε
-optimal policy w.p. 1−
δ
. We also prove a matching lower bound of
Θ
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) on the sample complexity of estimating the optimal action-value function with
ε
accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of
N
,
ε
,
δ
and 1/(1−
γ
) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−
γ
). |
---|---|
Bibliography: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2 |
ISSN: | 0885-6125 1573-0565 |
DOI: | 10.1007/s10994-013-5368-1 |