Two-Stream Proximity Graph Transformer for Skeletal Person-Person Interaction Recognition With Statistical Information

Recognizing person-person interactions is practically significant and this type of interactive recognition is applied in many fields, such as video understanding and video surveillance. Compared with RGB data, skeletal data can more accurately depict articulated human movements due to its detailed r...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 193091 - 193100
Main Authors	Li, Meng, Wu, Yaqi, Sun, Qiumei, Yang, Weifeng
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accuracy Bones Computer vision Data mining Data models Datasets Feature extraction Graphs Human activity recognition Human motion Interaction recognition Joints Proximity proximity graphs Recognition Semantics Spatiotemporal data Spatiotemporal phenomena transformer Transformers two-stream networks Video data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recognizing person-person interactions is practically significant and this type of interactive recognition is applied in many fields, such as video understanding and video surveillance. Compared with RGB data, skeletal data can more accurately depict articulated human movements due to its detailed recording of joint locations. With the recent success of Transformer in computer vision, numerous scholars have begun to apply Transformer to recognize person-person interaction. However, these Transformer-based models do not fully take into account the dynamic spatiotemporal relationship between interacting people, which remains a challenge. To handle this challenge, we propose a novel Transformer-based model called Two-Stream Proximity Graph Transformer (2s-PGT) to recognize skeletal person-person interaction. Specifically, we first design three types of proximity graphs based on skeletal data to encode the dynamic proximity relationship between interacting people, including frame-based, sample-based and type-based proximity graphs. Secondly, we embed proximity graphs into our Transformer-based model to jointly learn the relationship between interacting people from spatiotemporal and semantic perspectives. We thirdly investigate a two-stream framework to integrate the information of interactive joints and interactive bones together to improve the accuracy of interaction recognition. Experimental results on the three public datasets, the SBU dataset (99.07%), the NTU-RGB+D dataset (Cross-Subject (95.72%), Cross-View (97.87%)) and the NTU-RGB+D120 dataset (Cross-Subject (92.01%), Cross-View (91.65%)), demonstrate that our approach outperforms the state-of-the-art methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3516511