CST-RL: Contrastive Spatio-Temporal Representations for Reinforcement Learning

Learning representations from high-dimensional observations is critical for training of pixel-based continuous control tasks with reinforcement learning (RL). Without proper representations, the training will be very inefficient, requiring long training time and huge training data to learn directly...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 11; p. 1
Main Authors	Ho, Chi-Kai, King, Chung-Ta
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	3D CNNs Artificial neural networks contrastive learning Control tasks Correlation Feature extraction Machine learning Pixels Reinforcement learning Representation learning Representations sample efficiency Spatial data spatio-temporal representation learning Spatiotemporal phenomena Standard scores Task analysis Three-dimensional displays Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Learning representations from high-dimensional observations is critical for training of pixel-based continuous control tasks with reinforcement learning (RL). Without proper representations, the training will be very inefficient, requiring long training time and huge training data to learn directly from low-level pixel observations. Yet, a lot of information in such observations may be redundant or irrelevant. A common approach to solving this problem is to train auxiliary objectives alongside the main RL objective. The additional objectives provide more signals to the model and reduce the training time, resulting in better sample efficiency. A representative work is Contrastive Unsupervised Representations for Reinforcement Learning (CURL), which leverages contrastive learning to assist RL to learn useful representations. Although CURL performs very well in extracting spatial information from pixel inputs, it is found to overlook potential temporal signals. In this paper, a contrastive spatio-temporal representation learning framework for RL, called CST-RL, is introduced, which leverages 3D Convolutional Neural Network (3D CNN) alongside contrastive learning for sample-efficient RL. It pays attention to both spatial and temporal signals in pixel observations. Experiments based on DMControl show that CST-RL outperforms CURL in all six environments after 500K environment steps and only needs half of the steps to achieve the standard score in the majority of cases.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3258540