Gradient compensation traces based temporal difference learning

•An implementation of eligibility traces, named GCT, is proposed.•GCT maintains a long term credit-assignment without the suffering of λ.•The convergence of GCT based TD (named GCTD) is held theoretically.•The data efficiency of GCTD is empirically improved in the sparse reward setting. For online u...

Full description

Saved in:
Bibliographic Details
Published inNeurocomputing (Amsterdam) Vol. 442; pp. 221 - 235
Main Authors Bi, Wang, Xuelian, Li, Zhiqiang, Gao, Yang, Chen
Format Journal Article
LanguageEnglish
Published Elsevier B.V 28.06.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•An implementation of eligibility traces, named GCT, is proposed.•GCT maintains a long term credit-assignment without the suffering of λ.•The convergence of GCT based TD (named GCTD) is held theoretically.•The data efficiency of GCTD is empirically improved in the sparse reward setting. For online updates and data efficiency, forward-view algorithms are transformed into backward-views, such as temporal difference learning (TD) and its control versions, by eligibility traces. Existing researches on eligibility traces, such as TD(λ) and true-online TD(λ), mainly focus on the equivalence between forward-views and backward-views. However, the choice of λ refers to the time scope of the credit-assignment, and a small λ accelerates the decay of credit over the time. This paper takes a different implementation of the backward-view named gradient compensation traces (GCT). GCT compensates the difference between a bootstrapping estimated gradient and the true gradient online to remove the extra decay of the credit. Based on GCT, the corresponding temporal difference learning (gradient compensation TD, GCTD) is proved to converge conditionally. The sensitivity of GCTD’s hyper-parameter is analyzed in the nonlinear long-corridor and linear random-walk task. The proposed algorithm is comparable with true-online TD(λ) in the basic Mountain Car task, and outperforms the baselines in the reward sparse setting.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2021.02.042