Relational Space-Time Query in Long-Form Videos

Egocentric videos are often available in the form of uninterrupted, uncurated long videos capturing the camera wearers' daily life activities. Understanding these videos requires models to be able to reason about activities, objects, and their interactions. However, current video benchmarks stu...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6398 - 6408
Main Authors	Yang, Xitong, Chu, Fu-Jen, Feiszli, Matt, Goyal, Raghav, Torresani, Lorenzo, Tran, Du
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2023
Subjects	Benchmark testing Cameras Cognition Computational modeling Computer vision Pattern recognition Spatiotemporal phenomena Video: Action and event understanding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Egocentric videos are often available in the form of uninterrupted, uncurated long videos capturing the camera wearers' daily life activities. Understanding these videos requires models to be able to reason about activities, objects, and their interactions. However, current video benchmarks study these problems independently and under short, curated clips. In contrast, real-world applications, e.g. AR assistants, require bundling these problems for both model development and evaluation. In this paper, we propose to study these problems in a joint framework for long video understanding. Our contributions are three-fold. First, we propose an integrated framework, namely Relational Space-Time Query (ReST), for evaluating video understanding models via templated spatiotemporal queries. Second, we introduce two new benchmarks, ReST-ADL and ReST-Ego4D 1 1 The latest version of our benchmark and models will be available here., which augment the existing egocentric video datasets with abundant query annotations generated by the ReST framework. Finally, we present a set of baselines and in-depth analysis on the two benchmarks and provide insights about the query tasks. We view our integrated framework and benchmarks as a step towards comprehensive, multi-step reasoning in long videos, and believe it will facilitate the development of next generations of video understanding models.
ISSN:	2575-7075
DOI:	10.1109/CVPR52729.2023.00619