ConvPose: A modern pure ConvNet for human pose estimation

[Display omitted] We build a modern convolutional neural network. Specifically, we design convolutional variants of some of the components in Transformers and incorporate these convolutional components into a convolutional neural network. As shown, these three figures respectively depict the archite...

Full description

Saved in:
Bibliographic Details
Published inNeurocomputing (Amsterdam) Vol. 544; p. 126301
Main Authors Niu, Yue, Wang, Annan, Wang, Xuewu, Wu, Shengxi
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.08.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:[Display omitted] We build a modern convolutional neural network. Specifically, we design convolutional variants of some of the components in Transformers and incorporate these convolutional components into a convolutional neural network. As shown, these three figures respectively depict the architecture of stage1, transition1 and a sub-block in other stages. Our model, termed ConvPose, is competitive with Transformers-based models despite being a pure convolutional neural network. •A pure convolutional neural network for human pose estimation is proposed.•Convolutional neural network architectures are still important for computer vision.•Introducing the designs of Transformers into a convolutional neural network will improve its performance.•A network with higher accuracy can be constructed without using Transformer architectures.•A new way for improving ConvNet is provided. Transformer-based networks almost thoroughly outperformed those based on convolutional neural network (ConvNet) and predominate in the field of pose estimation. To get off the hook and resuscitate ConvNets, we propose ConvPose, which is a pure ConvNet that does not utilize conventional improvement strategies like attention mechanisms and lightweight approaches, but instead pioneeringly modernizes network structures. The modernization process includes: deepening the stem cell and transition layers, using a separate pointwise convolution layer, adopting a batch normalization (BN) layer after resizing the feature maps, employing large-kernel depthwise separable convolutions and designing re-parameterized-style structures, constructing two consecutive modules that contain a mixer and an inverted bottleneck, etc. All of these designs are similar to the corresponding Transformer architectures, which means translating Transformer-specific components into convolutional variations and incorporating them into a ConvNet. A modern ConvNet not only maintains the simplicity of convolutional, but also takes advantage of Transformers. The experiments show that ConvPose-BL achieves a 76.0 Average Precision (AP) score on the COCO val2017 dataset. ConvPose performs on par or better than the existing representative networks those based on Transformer and ConvNet, and represents slight superiority in terms of speed.
ISSN:0925-2312
1872-8286
DOI:10.1016/j.neucom.2023.126301