RE-STNet: relational enhancement spatio-temporal networks based on skeleton action recognition

Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 84; no. 8; pp. 4049 - 4069
Main Authors	Chen, Hongwei, He, Shiqi, Chen, Zexi
Format	Journal Article
Language	English
Published	New York Springer US 01.03.2025 Springer Nature B.V
Subjects	Computer Communication Networks Computer Science Convolution Data Structures and Information Theory Learning Methods Modules Multimedia Multimedia Information Systems Neural networks Special Purpose and Application-Based Systems Topology Track 6: Computer Vision for Multimedia Applications Action recognition Topology learning Graph convolutional networks Temporal incentive information Complementary information
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary information on the acquired joint features. Additionally, using a single-level topology restricts the exploration of global node relationships, leading to potential loss of implicit node correlations that can impact model fusion. To address these challenges, this study introduces the Relational Enhancement Spatio-Temporal Networks (RE-STNet). RE-STNet employs a complementary relationship graph convolution method to capture crucial joint connections and corresponding positional information within the region. The joint cross-connection module captures the global receptive field of the current pose. Furthermore, since there will be a lot of invalid information in the action sequence, this paper proposes a temporal incentive module to capture the salient temporal frame information and combines it with a multi-scale temporal convolution module to enrich the temporal features. The resulting architecture RE-STNet is evaluated through experiments across three skeleton datasets, achieving an accuracy of 92.2% in the NTU RGB+D 60 cross-subject split, 88.6% in the NTU RGB+D 120 cross-subject split, and 95.5% in NW-UCLA. The experimental results demonstrate that our model enables the learning of more comprehensive spatial-temporal joint information.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-024-18864-y