MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. Howev...

Full description

Saved in:

Bibliographic Details
Published in	Neural processing letters Vol. 54; no. 5; pp. 3941 - 3964
Main Authors	Wang, Rui, Geng, Fudi, Wang, Xiangyang
Format	Journal Article
Language	English
Published	New York Springer US 01.10.2022 Springer Nature B.V
Subjects	Accuracy Artificial Intelligence Artificial neural networks Complex Systems Computational Intelligence Computer Science Computer vision Datasets High resolution Localization Methods Pattern recognition Pose estimation Semantics State of the art Transformers Multi-scale self-attention Human pose estimation Multi-scale transformers High-resolution networks
Online Access	Get full text
ISSN	1370-4621 1573-773X
DOI	10.1007/s11063-022-10794-w

Cover

More Information
Summary:	HRNet (High-Resolution Networks) as reported by Sun et al. (in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019) has been the state-of-the-art human pose estimation method, benefitting from its parallel high-resolution designed network structures. However, HRNet is still a typical CNN (Convolutional Neural Networks) architecture, with local convolution operations. Recently, Transformers have been successfully applied in many computer vision areas. The main mechanism in Transformers is self-attention, which can learn global or long-range dependencies among different parts. In this paper, we propose a human pose estimation framework built upon High-Resolution Multi-scale Transformers, termed MTPose. We combine the two advantages of high-resolution and Transformers together to improve the performance. Specifically, we design a sub-network, MTNet (Multi-scale Transformers-based high-resolution Networks), which consists of two parallel branches. One is high-resolution with convolutional local operations, named as local branch. The other is the global branch utilizing multi-scale Transformer encoders to learn long-range dependencies of the whole body keypoints. At the end of the networks, the two branches are integrated together to predict the final keypoint heatmaps. Experiments on two benchmark datasets, the MSCOCO keypoint detection dataset and MPII human pose dataset, demonstrate that our method can significantly improve the state-of-the-art human pose estimation methods. Code will be available at: https://github.com/fudiGeng/MTPose .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-022-10794-w