Learning the Visual Interpretation of Sentences

Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate t...

Full description

Saved in:
Bibliographic Details
Published in2013 IEEE International Conference on Computer Vision pp. 1681 - 1688
Main Authors Zitnick, C. Lawrence, Parikh, Devi, Vanderwende, Lucy
Format Conference Proceeding Journal Article
LanguageEnglish
Published IEEE 01.12.2013
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Sentences that describe visual scenes contain a wide variety of information pertaining to the presence of objects, their attributes and their spatial relations. In this paper we learn the visual features that correspond to semantic phrases derived from sentences. Specifically, we extract predicate tuples that contain two nouns and a relation. The relation may take several forms, such as a verb, preposition, adjective or their combination. We model a scene using a Conditional Random Field (CRF) formulation where each node corresponds to an object, and the edges to their relations. We determine the potentials of the CRF using the tuples extracted from the sentences. We generate novel scenes depicting the sentences' visual meaning by sampling from the CRF. The CRF is also used to score a set of scenes for a text-based image retrieval task. Our results show we can generate (retrieve) scenes that convey the desired semantic meaning, even when scenes (queries) are described by multiple sentences. Significant improvement is found over several baseline approaches.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Conference-1
ObjectType-Feature-3
content type line 23
SourceType-Conference Papers & Proceedings-2
ISSN:1550-5499
DOI:10.1109/ICCV.2013.211