VideoMAC: Video Masked Autoencoders Meet ConvNets

Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predomi-nant approaches in existing masked image / video modeling rely excessively...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 22733 - 22743
Main Authors	Pei, Gensheng, Chen, Tao, Jiang, Xiruo, Liu, Huafeng, Sun, Zeren, Yao, Yazhou
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Computer vision Performance gain Representation learning Self-supervised learning Target tracking Transformers Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, the advancement of self-supervised learning techniques, like masked autoencoders (MAE), has greatly influenced visual representation learning for images and videos. Nevertheless, it is worth noting that the predomi-nant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper, we propose a new approach termed as VideoMAC, which combines video masked autoencoders with resource-friendly Con-vNets. Specifically, VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation, we utilize ConvNets which are implemented with sparse convolutional operators as en-coders. Simultaneously, we present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture comprising an online encoder and an exponential moving average target encoder, aimed to facilitate inter-frame reconstruction consistency in videos. Additionally, we demonstrate that VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM, outperforms ViT-based approaches on downstream tasks, including video object segmentation (+5.2% /6.4% J&F), body part propagation (+6.3% /3.1% mIoU), and human pose tracking (+10.2% / 11.1% PCK@0.1).
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.02145