Video question answering via grounded cross-attention network learning
•We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.•We propo...
Saved in:
Published in | Information processing & management Vol. 57; no. 4; p. 102265 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Oxford
Elsevier Ltd
01.07.2020
Elsevier Science Ltd |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •We study the problem of video question answering from the viewpoint of modeling the rough video representation and the grounded video representation. The joint question-video representation based on rough representation and grounded representation of video is learned for answer predicting.•We propose the grounded cross-attention network learning framework, which is a novel hierarchical cross-attention method with a Q-O cross-attention layer and a Q-V- H cross-attention layer. The proposed GCANet adopts a novel mutual attention learning mechanism.•We construct two large-scale datasets for video question answering. The extensive experiments validate the effectiveness of our method.
Video Question Answering is a burgeoning and challenging task in visual information retrieval (VIR), which automatically generates the answer to a question based on referenced video content. Different from the existing visual question answering methods which mainly focus on static image content, video question answering takes temporal dimension into account because of the essential difference in the structure between image and video. In this paper, we study the problem of video question answering from the viewpoint of grounded cross-attention network learning. Specifically, we propose a novel hierarchical cross-attention mechanism of mutual attention learning for video question answering, named as GCANet. We first obtain the multi-level rough video representation from frame-level video features and clip-level video features. Then, we utilize region proposal network to generate object-level grounded video features as grounded video representations. Next, the grounded question-video representation is learned by the first layer of the GCANet framework, named as Q−O cross-attention layer. The second Q−V−H cross-attention layer of the GCANet framework helps to learn the joint question-video representation based on both rough representation and grounded representation of video for video question answering. We construct two large-scale video question answering datasets. The experimental results on the proposed datasets demonstrate the effectiveness of our model. |
---|---|
ISSN: | 0306-4573 1873-5371 |
DOI: | 10.1016/j.ipm.2020.102265 |