Motion-Driven Spatial and Temporal Adaptive High-Resolution Graph Convolutional Networks for Skeleton-Based Action Recognition

Graph convolutional networks (GCN) have attracted increasing interest in action recognition in recent years. GCN models human skeleton sequences as spatio-temporal graphs. Also, attention mechanisms are often jointly used with GCNs to highlight important frames or body joints in a sequence. However,...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 4; pp. 1868 - 1883
Main Authors	Huang, Zengxi, Qin, Yusong, Lin, Xiaobing, Liu, Tianlin, Feng, Zhenhua, Liu, Yiguang
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Activity recognition Adaptation Adaptation models Artificial neural networks Convolution Correlation Data mining Datasets Feature extraction Frames fully connected graph convolution Graph convolutional networks Graphs High resolution high-resolution graph Joints Representations Skeleton skeleton motion skeleton-based action recognition spatial and temporal adaptation Spatial dependencies
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Graph convolutional networks (GCN) have attracted increasing interest in action recognition in recent years. GCN models human skeleton sequences as spatio-temporal graphs. Also, attention mechanisms are often jointly used with GCNs to highlight important frames or body joints in a sequence. However, attention modules learn parameters offline and are fixed, so may not adapt well to unseen samples. In this paper, we propose a simple but effective motion-driven spatial and temporal adaptation strategy to dynamically strengthen the features of important frames and joints for skeleton-based action recognition. The rationale is that the joints and frames with dramatic motions are generally more informative and discriminative. We combine the spatial and temporal refinements by using a two-branch structure, in which the joint and frame-wise feature refinements perform in parallel. Such a structure can lead to learn more complementary feature representations. Moreover, we propose to use the fully connected graph convolution to learn the long-range spatial dependencies. Besides, we investigate two high-resolution skeleton graphs by creating virtual joints, aiming to improve the representation of skeleton features. By combining the above proposals, we develop a novel motion-driven spatial and temporal adaptive high-resolution GCN. Experimental results demonstrate that the proposed model achieves state-of-the-art (SOTA) results on the challenging large-scale Kinetics-Skeleton and UAV-Human datasets, and it is on par with the SOTA methods on the two NTU-RGB+D 60&120 datasets. Additionally, our motion-driven adaptation method shows encouraging performance when compared with the attention mechanisms.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2022.3217763