Syntax-Aware Action Targeting for Video Captioning

Existing methods on video captioning have made great efforts to identify objects/instances in videos, but few of them emphasize the prediction of action. As a result, the learned models are likely to depend heavily on the prior of training data, such as the co-occurrence of objects, which may cause...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 13093 - 13102
Main Authors Zheng, Qi, Wang, Chaoyue, Tao, Dacheng
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Existing methods on video captioning have made great efforts to identify objects/instances in videos, but few of them emphasize the prediction of action. As a result, the learned models are likely to depend heavily on the prior of training data, such as the co-occurrence of objects, which may cause an enormous divergence between the generated descriptions and the video content. In this paper, we explicitly emphasize the importance of \textit{action} by predicting visually-related syntax components including \textit{subject}, \textit{object} and \textit{predicate}. Specifically, we propose a Syntax-Aware Action Targeting (SAAT) module that firstly builds a self-attended scene representation to draw global dependence among multiple objects within a scene, and then decodes the visually-related syntax components by setting different queries. After targeting the \textit{action}, indicated by \textit{predicate}, our captioner learns an attention distribution over the \textit{predicate} and the previously predicted words to guide the generation of the next word. Comprehensive experiments on MSVD and MSR-VTT datasets demonstrate the efficacy of the proposed model.
ISSN:1063-6919
DOI:10.1109/CVPR42600.2020.01311