Reinforcement Learning for Partially Observable Dynamic Processes: Adaptive Dynamic Programming Using Measured Output Data

Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on systems, man and cybernetics. Part B, Cybernetics Vol. 41; no. 1; pp. 14 - 25
Main Authors	Lewis, F L, Vamvoudakis, K G
Format	Journal Article
Language	English
Published	United States IEEE 01.02.2011
Subjects	Algorithms Approximate dynamic programming (ADP) Artificial Intelligence Control systems data-based optimal control Dynamic programming Dynamical systems Dynamics Equivalence Feedback Feedback control Learning Markov Chains Optimal control Output feedback output feedback (OPFB) policy iteration (PI) Polynomials Reinforcement Reinforcement (Psychology) State feedback Stochastic systems Upper bound value iteration (VI)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Approximate dynamic programming (ADP) is a class of reinforcement learning methods that have shown their importance in a variety of applications, including feedback control of dynamical systems. ADP generally requires full information about the system internal states, which is usually not available in practical situations. In this paper, we show how to implement ADP methods using only measured input/output data from the system. Linear dynamical systems with deterministic behavior are considered herein, which are systems of great interest in the control system community. In control system theory, these types of methods are referred to as output feedback (OPFB). The stochastic equivalent of the systems dealt with in this paper is a class of partially observable Markov decision processes. We develop both policy iteration and value iteration algorithms that converge to an optimal controller that requires only OPFB. It is shown that, similar to Q-learning, the new methods have the important advantage that knowledge of the system dynamics is not needed for the implementation of these learning algorithms or for the OPFB control. Only the order of the system, as well as an upper bound on its "observability index," must be known. The learned OPFB controller is in the form of a polynomial autoregressive moving-average controller that has equivalent performance with the optimal state variable feedback gain.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	1083-4419 1941-0492
DOI:	10.1109/TSMCB.2010.2043839