Motion-Driven Spatial and Temporal Adaptive High-Resolution Graph Convolutional Networks for Skeleton-Based Action Recognition

Graph convolutional networks (GCN) have attracted increasing interest in action recognition in recent years. GCN models human skeleton sequences as spatio-temporal graphs. Also, attention mechanisms are often jointly used with GCNs to highlight important frames or body joints in a sequence. However,...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 33; no. 4; pp. 1868 - 1883
Main Authors Huang, Zengxi, Qin, Yusong, Lin, Xiaobing, Liu, Tianlin, Feng, Zhenhua, Liu, Yiguang
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Graph convolutional networks (GCN) have attracted increasing interest in action recognition in recent years. GCN models human skeleton sequences as spatio-temporal graphs. Also, attention mechanisms are often jointly used with GCNs to highlight important frames or body joints in a sequence. However, attention modules learn parameters offline and are fixed, so may not adapt well to unseen samples. In this paper, we propose a simple but effective motion-driven spatial and temporal adaptation strategy to dynamically strengthen the features of important frames and joints for skeleton-based action recognition. The rationale is that the joints and frames with dramatic motions are generally more informative and discriminative. We combine the spatial and temporal refinements by using a two-branch structure, in which the joint and frame-wise feature refinements perform in parallel. Such a structure can lead to learn more complementary feature representations. Moreover, we propose to use the fully connected graph convolution to learn the long-range spatial dependencies. Besides, we investigate two high-resolution skeleton graphs by creating virtual joints, aiming to improve the representation of skeleton features. By combining the above proposals, we develop a novel motion-driven spatial and temporal adaptive high-resolution GCN. Experimental results demonstrate that the proposed model achieves state-of-the-art (SOTA) results on the challenging large-scale Kinetics-Skeleton and UAV-Human datasets, and it is on par with the SOTA methods on the two NTU-RGB+D 60&120 datasets. Additionally, our motion-driven adaptation method shows encouraging performance when compared with the attention mechanisms.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2022.3217763