RE-STNet: relational enhancement spatio-temporal networks based on skeleton action recognition

Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 84; no. 8; pp. 4049 - 4069
Main Authors Chen, Hongwei, He, Shiqi, Chen, Zexi
Format Journal Article
LanguageEnglish
Published New York Springer US 01.03.2025
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary information on the acquired joint features. Additionally, using a single-level topology restricts the exploration of global node relationships, leading to potential loss of implicit node correlations that can impact model fusion. To address these challenges, this study introduces the Relational Enhancement Spatio-Temporal Networks (RE-STNet). RE-STNet employs a complementary relationship graph convolution method to capture crucial joint connections and corresponding positional information within the region. The joint cross-connection module captures the global receptive field of the current pose. Furthermore, since there will be a lot of invalid information in the action sequence, this paper proposes a temporal incentive module to capture the salient temporal frame information and combines it with a multi-scale temporal convolution module to enrich the temporal features. The resulting architecture RE-STNet is evaluated through experiments across three skeleton datasets, achieving an accuracy of 92.2% in the NTU RGB+D 60 cross-subject split, 88.6% in the NTU RGB+D 120 cross-subject split, and 95.5% in NW-UCLA. The experimental results demonstrate that our model enables the learning of more comprehensive spatial-temporal joint information.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1573-7721
1380-7501
1573-7721
DOI:10.1007/s11042-024-18864-y