Scene-Graph-Guided message passing network for dense captioning

•We propose to leverage the rich visual concepts and the structured knowledge for dense caption generation.•We use the objective function of scene graph generation to propagate the structured knowledge through the refining pipeline.•Expcrimental results and qualitative experiments confirm the cffect...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition letters Vol. 145; pp. 187 - 193
Main Authors Liu, An-An, Wang, Yanhui, Xu, Ning, Liu, Shan, Li, Xuanya
Format Journal Article
LanguageEnglish
Published Amsterdam Elsevier B.V 01.05.2021
Elsevier Science Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose to leverage the rich visual concepts and the structured knowledge for dense caption generation.•We use the objective function of scene graph generation to propagate the structured knowledge through the refining pipeline.•Expcrimental results and qualitative experiments confirm the cffect of our model. Dense captioning task aims to both localize and describe salient regions in images with natural languages. It can benefit from the rich visual concepts, including objects, pair-wise relationships and so on. However, due to the challenging combinatorial complexity of formulating <subject-predicate-object> triplets, very little work has been done to integrate them into the dense captioning task. Inspired by the recent success in scene graph generation for object and relationship detections, we propose a scene-graph-guided message passing network for dense caption generation. We first exploit message passing between objects and their relationships with a feature refining structure. Moreover, we formulate the message passing as the inter-connected visual concept generation problem while the objective function of scene graph generation is used to guide the region feature learning. Scene graph guide can propagate the structured knowledge of graph through the concept-region message passing mechanism (CR-MPM), which can improve the regional feature representation. Finally, the refined regional features are encoded by a LSTM-based decoder to generate dense captions. Our model can achieve competing performances on Visual Genome comparing against existing methods. Qualitative experiments also confirm the effect of our model in the dense captioning task.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0167-8655
1872-7344
DOI:10.1016/j.patrec.2021.01.024