ConvPose: A modern pure ConvNet for human pose estimation
[Display omitted] We build a modern convolutional neural network. Specifically, we design convolutional variants of some of the components in Transformers and incorporate these convolutional components into a convolutional neural network. As shown, these three figures respectively depict the archite...
Saved in:
Published in | Neurocomputing (Amsterdam) Vol. 544; p. 126301 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.08.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | [Display omitted]
We build a modern convolutional neural network. Specifically, we design convolutional variants of some of the components in Transformers and incorporate these convolutional components into a convolutional neural network. As shown, these three figures respectively depict the architecture of stage1, transition1 and a sub-block in other stages. Our model, termed ConvPose, is competitive with Transformers-based models despite being a pure convolutional neural network.
•A pure convolutional neural network for human pose estimation is proposed.•Convolutional neural network architectures are still important for computer vision.•Introducing the designs of Transformers into a convolutional neural network will improve its performance.•A network with higher accuracy can be constructed without using Transformer architectures.•A new way for improving ConvNet is provided.
Transformer-based networks almost thoroughly outperformed those based on convolutional neural network (ConvNet) and predominate in the field of pose estimation. To get off the hook and resuscitate ConvNets, we propose ConvPose, which is a pure ConvNet that does not utilize conventional improvement strategies like attention mechanisms and lightweight approaches, but instead pioneeringly modernizes network structures. The modernization process includes: deepening the stem cell and transition layers, using a separate pointwise convolution layer, adopting a batch normalization (BN) layer after resizing the feature maps, employing large-kernel depthwise separable convolutions and designing re-parameterized-style structures, constructing two consecutive modules that contain a mixer and an inverted bottleneck, etc. All of these designs are similar to the corresponding Transformer architectures, which means translating Transformer-specific components into convolutional variations and incorporating them into a ConvNet. A modern ConvNet not only maintains the simplicity of convolutional, but also takes advantage of Transformers. The experiments show that ConvPose-BL achieves a 76.0 Average Precision (AP) score on the COCO val2017 dataset. ConvPose performs on par or better than the existing representative networks those based on Transformer and ConvNet, and represents slight superiority in terms of speed. |
---|---|
ISSN: | 0925-2312 1872-8286 |
DOI: | 10.1016/j.neucom.2023.126301 |