Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention

Automatically generating a natural language description of an image is one of the most fundamental and challenging problems in Multimedia Intelligence because it translates information between two different modalities, while such translation requires the ability to understand both modalities. The ex...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 80; no. 11; pp. 16267 - 16282
Main Authors	Zhang, Wenqiao, Tang, Siliang, Su, Jiajie, Xiao, Jun, Zhuang, Yueting
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2021 Springer Nature B.V
Subjects	Coders Computer Communication Networks Computer Science Cooperative learning Data Structures and Information Theory Encoders-Decoders Image management Image retrieval Learning Modules Multimedia Multimedia Information Systems Natural language Performance enhancement Special Purpose and Application-Based Systems Image caption Cooperative learning Hierarchical refined attention
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatically generating a natural language description of an image is one of the most fundamental and challenging problems in Multimedia Intelligence because it translates information between two different modalities, while such translation requires the ability to understand both modalities. The existing image captioning models have already achieved remarkable performance. However, they heavily rely on the Encoder-Decoder framework is a directional translation which is hard to be further improved. In this paper, we designed the “Tell and Guess” Cooperative Learning model with a Hierarchical Refined Attention mechanism (CL-HRA) that bidirectionally improves the performance to generate more informative captions. The Cooperative Learning (CL) method combines an image caption module (ICM) with an image retrieval module (IRM) - the ICM is responsible for the “Tell” function, which generates informative and natural language descriptions for a given image. While the IRM will “Guess” and try to select that image from a lineup of images based on the given description. Such cooperation mutually improves the learning of two modules. On the other hand, the Hierarchical Refined Attention (HRA) learns to selectively attend the high-level attributes and the low-level visual features, then incorporate them into CL to fulfill the objective gaps from image to caption. The HRA can pay different attention at the different semantic levels to refine the visual representation, while the CL with the human-like mindset is more interpretable to generate a more related caption for the corresponding image. The experimental results on Microsoft COCO dataset show the effectiveness of CL-HRA in terms of several popular image caption generation metrics.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-020-08832-7