LSTM-in-LSTM for generating long descriptions of images
In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual...
Saved in:
Published in | Computational visual media (Beijing) Vol. 2; no. 4; pp. 379 - 388 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Beijing
Tsinghua University Press
01.12.2016
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we propose an approach for generating rich
fine-grained
textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range
implicit
contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the
explicit
multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate
long
rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR). |
---|---|
ISSN: | 2096-0433 2096-0662 |
DOI: | 10.1007/s41095-016-0059-z |