Two-Stream Proximity Graph Transformer for Skeletal Person-Person Interaction Recognition With Statistical Information

Recognizing person-person interactions is practically significant and this type of interactive recognition is applied in many fields, such as video understanding and video surveillance. Compared with RGB data, skeletal data can more accurately depict articulated human movements due to its detailed r...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 12; pp. 193091 - 193100
Main Authors Li, Meng, Wu, Yaqi, Sun, Qiumei, Yang, Weifeng
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recognizing person-person interactions is practically significant and this type of interactive recognition is applied in many fields, such as video understanding and video surveillance. Compared with RGB data, skeletal data can more accurately depict articulated human movements due to its detailed recording of joint locations. With the recent success of Transformer in computer vision, numerous scholars have begun to apply Transformer to recognize person-person interaction. However, these Transformer-based models do not fully take into account the dynamic spatiotemporal relationship between interacting people, which remains a challenge. To handle this challenge, we propose a novel Transformer-based model called Two-Stream Proximity Graph Transformer (2s-PGT) to recognize skeletal person-person interaction. Specifically, we first design three types of proximity graphs based on skeletal data to encode the dynamic proximity relationship between interacting people, including frame-based, sample-based and type-based proximity graphs. Secondly, we embed proximity graphs into our Transformer-based model to jointly learn the relationship between interacting people from spatiotemporal and semantic perspectives. We thirdly investigate a two-stream framework to integrate the information of interactive joints and interactive bones together to improve the accuracy of interaction recognition. Experimental results on the three public datasets, the SBU dataset (99.07%), the NTU-RGB+D dataset (Cross-Subject (95.72%), Cross-View (97.87%)) and the NTU-RGB+D120 dataset (Cross-Subject (92.01%), Cross-View (91.65%)), demonstrate that our approach outperforms the state-of-the-art methods.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3516511