SSpose: Self-supervised Spatial-aware Model for Human Pose Estimation

Human pose estimation heavily relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by CNN-based models in human pose estimation, they typically fail to explicitly learn the global dependencies among various body parts. To ov...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on artificial intelligence Vol. 5; no. 11; pp. 1 - 14
Main Authors Yu, Linfang, Qin, Zhen, Xu, Liqun, Qin, Zhiguang, Choo, Kim-Kwang Raymond
Format Journal Article
LanguageEnglish
Published IEEE 01.11.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Human pose estimation heavily relies on the anatomical relationships among different body parts to locate keypoints. Despite the significant progress achieved by CNN-based models in human pose estimation, they typically fail to explicitly learn the global dependencies among various body parts. To overcome this limitation, we propose a spatial-aware human pose estimation model called SSpose that explicitly captures the spatial dependencies between specific key points and different locations in an image. The proposed SSpose model adopts a hybrid CNN-Transformer encoder to simultaneously capture local features and global dependencies. To better preserve image details, a multi-scale fusion module is introduced to integrate coarse and fine-grained image information. By establishing a connection with the activation maximization (AM) principle, the final attention layer of the Transformer aggregates contributions (i.e., attention scores) from all image positions and forms the maximum position in the heatmap, thereby achieving keypoint localization in the head structure. Additionally, to address the issue of visible information leakage in convolutional reconstruction, we have devised a self-supervised training framework for the SSpose model. This framework incorporates mask autoencoder (MAE) technology into SSpose models by utilizing masked convolution and hierarchical masking strategy, thereby facilitating efficient self-supervised learning. Extensive experiments demonstrate that SSpose performs exceptionally well in the pose estimation task. On the COCO val set, it achieves an AP and AR of 77.3% and 82.1%, respectively, while on the COCO test-dev set, the AP and AR are 76.4% and 81.5%. Moreover, the model exhibits strong generalization capabilities on MPII. Code is available at https://github.com/yulinfangylf/SSpose.
ISSN:2691-4581
2691-4581
DOI:10.1109/TAI.2024.3440220