Deformable Video Transformer
Video transformers have recently emerged as an effective alternative to convolutional networks for action classification. However, most prior video transformers adopt either global space-time attention or hand-defined strategies to compare patches within and across frames. These fixed attention sche...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
31.03.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Video transformers have recently emerged as an effective alternative to
convolutional networks for action classification. However, most prior video
transformers adopt either global space-time attention or hand-defined
strategies to compare patches within and across frames. These fixed attention
schemes not only have high computational cost but, by comparing patches at
predetermined locations, they neglect the motion dynamics in the video. In this
paper, we introduce the Deformable Video Transformer (DVT), which dynamically
predicts a small subset of video patches to attend for each query location
based on motion information, thus allowing the model to decide where to look in
the video based on correspondences across frames. Crucially, these motion-based
correspondences are obtained at zero-cost from information stored in the
compressed format of the video. Our deformable attention mechanism is optimised
directly with respect to classification performance, thus eliminating the need
for suboptimal hand-design of attention strategies. Experiments on four
large-scale video benchmarks (Kinetics-400, Something-Something-V2,
EPIC-KITCHENS and Diving-48) demonstrate that, compared to existing video
transformers, our model achieves higher accuracy at the same or lower
computational cost, and it attains state-of-the-art results on these four
datasets. |
---|---|
DOI: | 10.48550/arxiv.2203.16795 |