MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. Howev...

Full description

Saved in:
Bibliographic Details
Published inNeural processing letters Vol. 54; no. 5; pp. 3941 - 3964
Main Authors Wang, Rui, Geng, Fudi, Wang, Xiangyang
Format Journal Article
LanguageEnglish
Published New York Springer US 01.10.2022
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1370-4621
1573-773X
DOI10.1007/s11063-022-10794-w

Cover

More Information
Summary:HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. However, HRNet is still a typical CNN (Convolutional Neural Networks) architecture, with local convolution operations. Recently, Transformers have been successfully applied in many computer vision areas. The main mechanism in Transformers is self-attention, which can learn global or long-range dependencies among different parts. In this paper, we propose a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose. We combine the two advantages of high-resolution and Transformers together to improve the performance. Specifically, we design a sub-network, MTNet (Multi-scale Transformers-based high-resolution Networks), which consists of two parallel branches. One is high-resolution with convolutional local operations, named as local branch. The other is the global branch utilizing multi-scale Transformer encoders to learn long-range dependencies of the whole body keypoints. At the end of the networks, the two branches are integrated together to predict the final keypoint heatmaps. Experiments on two benchmark datasets, the MSCOCO keypoint detection dataset and MPII human pose dataset, demonstrate that our method can significantly improve the state-of-the-art human pose estimation methods. Code will be available at: https://github.com/fudiGeng/MTPose .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-022-10794-w