Distributed Off-Policy Temporal Difference Learning Using Primal-Dual Method

The goal of this paper is to provide theoretical analysis and additional insights on a distributed temporal-difference (TD)-learning algorithm for the multi-agent Markov decision processes (MDPs) via saddle-point viewpoints. The (single-agent) TD-learning is a reinforcement learning (RL) algorithm f...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 10; p. 1
Main Authors	Lee, Donghwan, Kim, Do Wan, Hu, Jianghai
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Communication networks Convergence distributed optimization Distributed processing Linear programming Machine learning Markov decision process Markov processes Multi-agent systems Multiagent systems Optimal control Optimization primal-dual method Reinforcement learning Reinforcement learning (RL) Saddle points saddle-point method Sequential analysis sequential decision problem Symmetric matrices temporal difference (TD) learning temporal difference learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The goal of this paper is to provide theoretical analysis and additional insights on a distributed temporal-difference (TD)-learning algorithm for the multi-agent Markov decision processes (MDPs) via saddle-point viewpoints. The (single-agent) TD-learning is a reinforcement learning (RL) algorithm for evaluating a given policy based on reward feedbacks. In multi-agent settings, multiple RL agents concurrently behave, and each agent receives its local rewards. The goal of each agent is to evaluate a given policy corresponding to the global reward, which is an average of the local rewards by sharing learning parameters through random network communications. In this paper, we propose a distributed TD-learning based on saddle-point frameworks, and provide rigorous analysis of finite-time convergence of the algorithm and its solution based on tools in optimization theory. The results in this paper provide general and unified perspectives of the distributed policy evaluation problem, and theoretically complement the previous works.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2022.3211395