RE-STNet: relational enhancement spatio-temporal networks based on skeleton action recognition
Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary...
Saved in:
Published in | Multimedia tools and applications Vol. 84; no. 8; pp. 4049 - 4069 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.03.2025
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Learning comprehensive spatio-temporal joint connections in complex actions is crucial for recognizing skeleton sequence actions. However, existing research methods equally extract spatio-temporal features without focusing on critical joint connections, and failing to provide effective complementary information on the acquired joint features. Additionally, using a single-level topology restricts the exploration of global node relationships, leading to potential loss of implicit node correlations that can impact model fusion. To address these challenges, this study introduces the Relational Enhancement Spatio-Temporal Networks (RE-STNet). RE-STNet employs a complementary relationship graph convolution method to capture crucial joint connections and corresponding positional information within the region. The joint cross-connection module captures the global receptive field of the current pose. Furthermore, since there will be a lot of invalid information in the action sequence, this paper proposes a temporal incentive module to capture the salient temporal frame information and combines it with a multi-scale temporal convolution module to enrich the temporal features. The resulting architecture RE-STNet is evaluated through experiments across three skeleton datasets, achieving an accuracy of 92.2% in the NTU RGB+D 60 cross-subject split, 88.6% in the NTU RGB+D 120 cross-subject split, and 95.5% in NW-UCLA. The experimental results demonstrate that our model enables the learning of more comprehensive spatial-temporal joint information. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1573-7721 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-024-18864-y |