Exploiting Static and Dynamic Human Joint Relations for 3D Pose Estimation via Cascade Transformers

Transformer has become the dominant model in natural language processing (NLP). Researchers have recently attempted to exploit transformer architecture for various computer vision tasks and achieved competitive results. However, few works have been done to explore transformer architecture for 3D hum...

Full description

Saved in:

Bibliographic Details
Published in	2022 26th International Conference on Pattern Recognition (ICPR) pp. 4522 - 4528
Main Authors	Song, Bo, Ji, Changjiang, Fan, Shuo
Format	Conference Proceeding
Language	English
Published	IEEE 21.08.2022
Subjects	casade networks Computational modeling Computer architecture Computer vision joint relations Pose estimation Skeleton Three-dimensional displays transformer Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Transformer has become the dominant model in natural language processing (NLP). Researchers have recently attempted to exploit transformer architecture for various computer vision tasks and achieved competitive results. However, few works have been done to explore transformer architecture for 3D human pose estimation (HPE). In this work, we propose cascade transformers, a novel transformer-based method for 3D HPE from a single image. Specifically, our cascade transformers consist of two transformer encoders exploiting static and dynamic human joint relations respectively. Leveraging the self-attention module and the cascade structure, our method comprehensively models the static and dynamic human joint relations. We evaluate our method on Human3.6M. Extensive experiments show that our method achieves excellent performance without explicitly using human skeleton priors. Notably, our single-image method achieves approximately the same performance as the current best transformer-based method PoseFormer even when PoseFormer uses 9 frames to predict pose.
ISSN:	2831-7475
DOI:	10.1109/ICPR56361.2022.9956421