K-centered Patch Sampling for Efficient Video Recognition

For decades, it has been a common practice to choose a subset of video frames for reducing the computational burden of a video understanding model. In this paper, we argue that this popular heuristic might be sub-optimal under recent transformer-based models. Specifically, inspired by that transform...

Full description

Saved in:
Bibliographic Details
Published inComputer Vision - ECCV 2022 Vol. 13695; pp. 160 - 176
Main Authors Park, Seong Hyeon, Tack, Jihoon, Heo, Byeongho, Ha, Jung-Woo, Shin, Jinwoo
Format Book Chapter
LanguageEnglish
Published Switzerland Springer 2022
Springer Nature Switzerland
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:For decades, it has been a common practice to choose a subset of video frames for reducing the computational burden of a video understanding model. In this paper, we argue that this popular heuristic might be sub-optimal under recent transformer-based models. Specifically, inspired by that transformers are built upon patches of video frames, we propose to sample patches rather than frames using the greedy K-center search, i.e., the farthest patch to what has been chosen so far is sampled iteratively. We then show that a transformer trained with the selected video patches can outperform its baseline trained with the video frames sampled in the traditional way. Furthermore, by adding a certain spatiotemporal structuredness condition, the proposed K-centered patch sampling can be even applied to the recent sophisticated video transformers, boosting their performance further. We demonstrate the superiority of our method on Something–Something and Kinetics datasets.
Bibliography:Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19833-5_10.
ISBN:9783031198328
3031198328
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-031-19833-5_10