End-to-End Multi-Person Pose Estimation with Transformers

Current methods of multi-person pose estimation typically treat the localization and association of body joints separately. In this paper, we propose the first fully end-to-end multi-person Pose Estimation framework with TRansformers, termed PETR. Our method views pose estimation as a hierarchical s...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11059 - 11068
Main Authors Shi, Dahu, Wei, Xing, Li, Liangqi, Ren, Ye, Tan, Wenming
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Current methods of multi-person pose estimation typically treat the localization and association of body joints separately. In this paper, we propose the first fully end-to-end multi-person Pose Estimation framework with TRansformers, termed PETR. Our method views pose estimation as a hierarchical set prediction problem and effectively removes the need for many hand-crafted modules like RoI cropping, NMS and grouping post-processing. In PETR, multiple pose queries are learned to directly reason a set of full-body poses. Then a joint decoder is utilized to further refine the poses by exploring the kinematic relations between body joints. With the attention mechanism, the proposed method is able to adaptively attend to the features most relevant to target keypoints, which largely overcomes the feature misalignment difficulty in pose estimation and improves the performance considerably. Extensive experiments on the MS COCO and CrowdPose benchmarks show that PETR plays favorably against state-of-the-art approaches in terms of both accuracy and efficiency. The code and models are available at https://github.com/hikvision-research/opera.
ISSN:2575-7075
DOI:10.1109/CVPR52688.2022.01079