Skeleton Action Recognition Based on Temporal Gated Unit and Adaptive Graph Convolution

In recent years, great progress has been made in the recognition of skeletal behaviors based on graph convolutional networks (GCNs). In most existing methods, however, the fixed adjacency matrix and fixed graph structure are used for skeleton data feature extraction in the spatial dimension, which u...

Full description

Saved in:

Bibliographic Details
Published in	Electronics (Basel) Vol. 11; no. 18; p. 2973
Main Authors	Zhu, Qilin, Deng, Hongmin, Wang, Kaixuan
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.09.2022
Subjects	Accuracy Artificial intelligence Convolution Euclidean space Feature extraction Fourier transforms Human acts Human behavior Identification and classification Machine vision Mathematical models Methods Modules Neural networks Parameters Recognition Spatial data China
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, great progress has been made in the recognition of skeletal behaviors based on graph convolutional networks (GCNs). In most existing methods, however, the fixed adjacency matrix and fixed graph structure are used for skeleton data feature extraction in the spatial dimension, which usually leads to weak spatial modeling ability, unsatisfactory generalization performance, and an excessive number of model parameters. Most of these methods follow the ST-GCN approach in the temporal dimension, which inevitably leads to a number of non-key frames, increasing the cost of feature extraction and causing the model to be slower in terms of feature extraction and the required computational burden. In this paper, a gated temporally and spatially adaptive graph convolutional network is proposed. On the one hand, a learnable parameter matrix which can adaptively learn the key information of the skeleton data in spatial dimension is added to the graph convolution layer, improving the feature extraction and generalizability of the model and reducing the number of parameters. On the other hand, a gated unit is added to the temporal feature extraction module to alleviate interference from non-critical frames and reduce computational complexity. A channel attention mechanism based on an SE module and a frame attention mechanism are used to enhance the model’s feature extraction ability. To prevent model degradation and ensure more stable training, residual links are added to each feature extraction module. The proposed approach was ultimately able to achieve 0.63% higher accuracy on the X-Sub benchmark with 4.46 M fewer parameters than GAT, one of the best SOTA methods. Inference speed of our model reaches as fast as 86.23 sequences/(second × GPU). Extensive experimental results further validate the effectiveness of our proposed approach on three large-scale datasets, namely, NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton.
ISSN:	2079-9292 2079-9292
DOI:	10.3390/electronics11182973