Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative m...

Full description

Saved in:

Bibliographic Details
Published in	Machine learning Vol. 91; no. 3; pp. 325 - 349
Main Authors	Gheshlaghi Azar, Mohammad, Munos, Rémi, Kappen, Hilbert J.
Format	Journal Article
Language	English
Published	Boston Springer US 01.06.2013 Springer Springer Nature B.V Springer Verlag
Subjects	Applied sciences Artificial Intelligence Complexity Computer Science Computer science; control theory; systems Control Decision making models Exact sciences and technology Learning Learning and adaptive systems Lower bounds Machine Learning Mathematical models Mechatronics Minimax technique Natural Language Processing (NLP) Optimization Policies Reinforcement Robotics Simulation and Modeling Sample complexity Learning theory Markov decision processes Reinforcement learning Markov process Lower bound Markov decision Probabilistic approach Optimal policy Optimal estimation Modeling Upper bound Value function Minimax problem Transition state Minimax method Reward Artificial intelligence Generative model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ ∈[0,1) only O ( N log( N / δ )/((1− γ ) 3 ε 2 )) state-transition samples are required to find an ε -optimal estimation of the action-value function with the probability (w.p.) 1− δ . Further, we prove that, for small values of ε , an order of O ( N log( N / δ )/((1− γ ) 3 ε 2 )) samples is required to find an ε -optimal policy w.p. 1− δ . We also prove a matching lower bound of Θ ( N log( N / δ )/((1− γ ) 3 ε 2 )) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N , ε , δ and 1/(1− γ ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1− γ ).
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-013-5368-1