Exploiting Static and Dynamic Human Joint Relations for 3D Pose Estimation via Cascade Transformers

Transformer has become the dominant model in natural language processing (NLP). Researchers have recently attempted to exploit transformer architecture for various computer vision tasks and achieved competitive results. However, few works have been done to explore transformer architecture for 3D hum...

Full description

Saved in:
Bibliographic Details
Published in2022 26th International Conference on Pattern Recognition (ICPR) pp. 4522 - 4528
Main Authors Song, Bo, Ji, Changjiang, Fan, Shuo
Format Conference Proceeding
LanguageEnglish
Published IEEE 21.08.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Transformer has become the dominant model in natural language processing (NLP). Researchers have recently attempted to exploit transformer architecture for various computer vision tasks and achieved competitive results. However, few works have been done to explore transformer architecture for 3D human pose estimation (HPE). In this work, we propose cascade transformers, a novel transformer-based method for 3D HPE from a single image. Specifically, our cascade transformers consist of two transformer encoders exploiting static and dynamic human joint relations respectively. Leveraging the self-attention module and the cascade structure, our method comprehensively models the static and dynamic human joint relations. We evaluate our method on Human3.6M. Extensive experiments show that our method achieves excellent performance without explicitly using human skeleton priors. Notably, our single-image method achieves approximately the same performance as the current best transformer-based method PoseFormer even when PoseFormer uses 9 frames to predict pose.
ISSN:2831-7475
DOI:10.1109/ICPR56361.2022.9956421