A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective

A scene graph provides a powerful intermediate knowledge structure for various visual tasks, including semantic image retrieval, image captioning, and visual question answering. In this paper, the task of predicting a scene graph for an image is formulated as two connected problems, ie, recognizing...

Full description

Saved in:

Bibliographic Details
Published in	Computational intelligence Vol. 35; no. 3; pp. 496 - 516
Main Authors	Gao, Wenjing, Zhu, Yonghua, Zhang, Wenjun, Zhang, Ke, Gao, Honghao
Format	Journal Article
Language	English
Published	Hoboken Blackwell Publishing Ltd 01.08.2019
Subjects	Algorithms Construction Dependence hierarchical recurrent neural network Image management Image retrieval Neural networks Object recognition Recurrent neural networks relationship triplet recognition scene graph Triplets visual attention mechanism Visual tasks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A scene graph provides a powerful intermediate knowledge structure for various visual tasks, including semantic image retrieval, image captioning, and visual question answering. In this paper, the task of predicting a scene graph for an image is formulated as two connected problems, ie, recognizing the relationship triplets, structured as <subject‐predicate‐object>, and constructing the scene graph from the recognized relationship triplets. For relationship triplet recognition, we develop a novel hierarchical recurrent neural network with visual attention mechanism. This model is composed of two attention‐based recurrent neural networks in a hierarchical organization. The first network generates a topic vector for each relationship triplet, whereas the second network predicts each word in that relationship triplet given the topic vector. This approach successfully captures the compositional structure and contextual dependency of an image and the relationship triplets describing its scene. For scene graph construction, an entity localization approach to determine the graph structure is presented with the assistance of available attention information. Then, the procedures for automatically converting the generated relationship triplets into a scene graph are clarified through an algorithm. Extensive experimental results on two widely used data sets verify the feasibility of the proposed approach.
Bibliography:	Honghao Gao, Shanghai Film Academy, Shanghai University, Shanghai, China ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0824-7935 1467-8640
DOI:	10.1111/coin.12202