Collaborative three-stream transformers for video captioning

As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each othe...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 235; p. 103799
Main Authors	Wang, Hao, Zhang, Libo, Fan, Heng, Luo, Tiejian
Format	Journal Article
Language	English
Published	Elsevier Inc 01.10.2023
Subjects	Cross-granularity Multi-modal Spatial–temporal domain Video captioning 68T45 Video captioning Multi-modal Spatial–temporal domain Cross-granularity
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!