DORi: Discovering Object Relationships for Moment Localization of a Natural Language Query in a Video

This paper studies the task of temporal moment localization in long untrimmed videos using natural language queries. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a languag...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1078 - 1087
Main Authors	Rodriguez-Opazo, Cristian, Marrese-Taylor, Edison, Fernando, Basura, Li, Hongdong, Gould, Stephen
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2021
Subjects	Benchmark testing Computer vision Conferences Feature extraction Location awareness Natural languages Technological innovation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper studies the task of temporal moment localization in long untrimmed videos using natural language queries. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.
ISSN:	2642-9381
DOI:	10.1109/WACV48630.2021.00112