Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
09.07.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2407.06871 |
Cover
Loading…
Summary: | Image-to-video adaptation seeks to efficiently adapt image models for use in
the video domain. Instead of finetuning the entire image backbone, many
image-to-video adaptation paradigms use lightweight adapters for temporal
modeling on top of the spatial module. However, these attempts are subject to
limitations in efficiency and interpretability. In this paper, we propose a
novel and efficient image-to-video adaptation strategy from the object-centric
perspective. Inspired by human perception, which identifies objects as key
components for video understanding, we integrate a proxy task of object
discovery into image-to-video transfer learning. Specifically, we adopt slot
attention with learnable queries to distill each frame into a compact set of
object tokens. These object-centric tokens are then processed through
object-time interaction layers to model object state changes across time.
Integrated with two novel object-level losses, we demonstrate the feasibility
of performing efficient temporal reasoning solely on the compressed
object-centric representations for video downstream tasks. Our method achieves
state-of-the-art performance with fewer tunable parameters, only 5\% of fully
finetuned models and 50\% of efficient tuning methods, on action recognition
benchmarks. In addition, our model performs favorably in zero-shot video object
segmentation without further retraining or object annotations, proving the
effectiveness of object-centric video understanding. |
---|---|
DOI: | 10.48550/arxiv.2407.06871 |