ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Recently, fully-transformer architectures have replaced the defacto convolutional architecture for the 3D human pose estimation task. In this paper, we propose ConvFormer , a novel convolutional transformer that leverages a new dynamic multi-headed convolutional self-attention mechanism for monocula...

Full description

Saved in:

Bibliographic Details
Published in	The Visual computer Vol. 40; no. 4; pp. 2555 - 2569
Main Authors	Diaz-Arias, Alec, Shin, Dmitriy
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.04.2024 Springer Nature B.V
Subjects	Accuracy Artificial Intelligence Computer Graphics Computer Science Connectivity Datasets Hypotheses Image Processing and Computer Vision Mathematical models Original Article Parameter identification Pose estimation Reduction Sparsity Three dimensional models Transformers Dynamic convolutions Monocular motion capture 3D human pose estimation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, fully-transformer architectures have replaced the defacto convolutional architecture for the 3D human pose estimation task. In this paper, we propose ConvFormer , a novel convolutional transformer that leverages a new dynamic multi-headed convolutional self-attention mechanism for monocular 3D human pose estimation. We designed a spatial and temporal convolutional transformer to comprehensively model human joint relations within individual frames and globally across the motion sequence. Moreover, we introduce a novel notion of temporal joints profile for our temporal ConvFormer that fuses complete temporal information immediately for a local neighborhood of joint features. We have quantitatively and qualitatively validated our method on three common benchmark datasets: Human3.6 M, MPI-INF-3DHP, and HumanEva. Extensive experiments have been conducted to identify the optimal hyper-parameter set. These experiments demonstrated that we achieved a significant parameter reduction relative to prior transformer models while attaining State-of-the-Art (SOTA) or near SOTA on all three datasets. Additionally, we achieved SOTA for Protocol III on H36M for both GT and CPN detection inputs. Finally, we obtained SOTA on all three metrics for the MPI-INF-3DHP dataset and for all three subjects on HumanEva under Protocol II.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0178-2789 1432-2315
DOI:	10.1007/s00371-023-02936-5