CSIT: Channel Spatial Integrated Transformer for human pose estimation

Human keypoints detection is different from general detection tasks and requires networks that can learn visual information and anatomical constraints. Since CNN is excellent in extracting texture features of images and transformer can learn the correlation among keypoints well, many CTPNets (CNN+tr...

Full description

Saved in:
Bibliographic Details
Published inIET image processing Vol. 17; no. 10; pp. 3002 - 3011
Main Authors Li, Shaohua, Zhang, Haixiang, Ma, Hanjie, Feng, Jie, Jiang, Mingfeng
Format Journal Article
LanguageEnglish
Published Wiley 01.08.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Human keypoints detection is different from general detection tasks and requires networks that can learn visual information and anatomical constraints. Since CNN is excellent in extracting texture features of images and transformer can learn the correlation among keypoints well, many CTPNets (CNN+transformer type human pose estimation networks) have emerged. However, these networks are unconcerned with the processing of the features extracted from the CNN and naturally expand only from the channel dimension, ignoring the spatial features in the visual information that are essential for complex detection tasks like pose estimation. So the channel spatial integrated transformer for human pose estimation, termed CSIT, is proposed. The visual information are summarized as texture and spatial information, and a parallel network is used to expand the feature maps in the channel and spatial dimensions to learn texture features and spatial features respectively. In addition, anatomically constrained information is learned by keypoint embeddings. At the end of the network, the 1D vector representation method with more advanced performance and more compatible with transformer's characteristics is used to predict keypoints. Experiments show that CSIT outperforms the mainstream CTPNets on the COCO test‐dev dataset, and also show satisfactory results on the MPII dataset. A new architecture for human pose estimation called CSIT is proposed, which focuses on spatial features in visual information and innovatively uses a parallel network to combine spatial features with texture features to fully extract information from images through transformer.
ISSN:1751-9659
1751-9667
DOI:10.1049/ipr2.12850