Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation

In this article we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating th...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 5; pp. 6415 - 6427
Main Authors	Honari, Sina, Constantin, Victor, Rhodin, Helge, Salzmann, Mathieu, Fua, Pascal
Format	Journal Article
Language	English
Published	United States IEEE 01.05.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	3D human pose Algorithms Cameras contrastive learning Crops Error reduction Feature extraction Humans Image reconstruction Learning Pose estimation Temporal feature extraction Three-dimensional displays Unsupervised learning unsupervised representation learning Video Videotape Recording
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this article we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2022.3215307