Overview of Natural Language Video Localization

Natural language video localization（NLVL）,which aims to locate a target moment from a video that semantically corresponds to a text query, is a novel and challenging task.Different from the task of temporal action localization, NLVL is more flexible without restrictions from predefined action catego...

Full description

Saved in:

Bibliographic Details
Published in	Ji suan ji ke xue Vol. 49; no. 9; pp. 111 - 122
Main Authors	Nie, Xiu-shan, Pan, Jia-nan, Tan, Zhi-fang, Liu, Xin-fang, Guo, Jie, Yin, Yi-long
Format	Journal Article
Language	Chinese
Published	Chongqing Guojia Kexue Jishu Bu 01.09.2022 Editorial office of Computer Science
Subjects	multimodal retrieval\|video moment localization\|video comprehension\|cross-modal alignment\|cross-modal interaction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Natural language video localization（NLVL）,which aims to locate a target moment from a video that semantically corresponds to a text query, is a novel and challenging task.Different from the task of temporal action localization, NLVL is more flexible without restrictions from predefined action categories.Meanwhile, NLVL is more challenging since it requires align semantic information from both visual and textual modalities.Besides, how to obtain the final timestamp from the alignment relationship is also a tough task.This paper first proposes the pipeline of NLVL,and then categorizes them into supervised and weakly-supervised methods according to whether there is supervised information, following by the analysis of the strengths and weaknesses of each kind of method.Subsequently, the dataset, evaluation protocols and the general performance analysis are presented.Finally, the possible perspectives are obtained by summarizing the existing methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1002-137X
DOI:	10.11896/jsjkx.220500130