Collaborative three-stream transformers for video captioning

As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each othe...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 235; p. 103799
Main Authors	Wang, Hao, Zhang, Libo, Fan, Heng, Luo, Tiejian
Format	Journal Article
Language	English
Published	Elsevier Inc 01.10.2023
Subjects	Cross-granularity Multi-modal Spatial–temporal domain Video captioning 68T45 Video captioning Multi-modal Spatial–temporal domain Cross-granularity
Online Access	Get full text

Cover

Loading…

More Information
Summary:	As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial–temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. •COST is proposed as a novel multi-branch framework for video captioning.•Designed module aligns interactions across branches, leading to precise captioning.•Proposed training objective enhances COST via constraints on embeddings’ semantics.•Abundant experiments show our method performs favorably against the state-of-the-art.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2023.103799