Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples With On-Policy Experiences

Soft actor-critic (SAC) is an off-policy actor-critic (AC) reinforcement learning (RL) algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the trade-off between expected return and entropy (randomness in the policy). It has achieved the state-of-the-art performa...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. PP; no. 3; pp. 1 - 9
Main Authors	Banerjee, Chayan, Chen, Zhiyong, Noman, Nasimul
Format	Journal Article
Language	English
Published	United States IEEE 01.03.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Benchmark testing Benchmarks Buffers Control tasks Data collection Entropy Learning systems Machine learning Off-policy learning on-policy learning Optimization policy optimization Regularization reinforcement learning (RL) soft actor critic (SAC) Task analysis Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Soft actor-critic (SAC) is an off-policy actor-critic (AC) reinforcement learning (RL) algorithm, essentially based on entropy regularization. SAC trains a policy by maximizing the trade-off between expected return and entropy (randomness in the policy). It has achieved the state-of-the-art performance on a range of continuous control benchmark tasks, outperforming prior on-policy and off-policy methods. SAC works in an off-policy fashion where data are sampled uniformly from past experiences (stored in a buffer) using which the parameters of the policy and value function networks are updated. We propose certain crucial modifications for boosting the performance of SAC and making it more sample efficient. In our proposed improved SAC (ISAC), we first introduce a new prioritization scheme for selecting better samples from the experience replay (ER) buffer. Second we use a mixture of the prioritized off-policy data with the latest on-policy data for training the policy and value function networks. We compare our approach with the vanilla SAC and some recent variants of SAC and show that our approach outperforms the said algorithmic benchmarks. It is comparatively more stable and sample efficient when tested on a number of continuous control tasks in MuJoCo environments.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2022.3174051