Learning to Collocate Neural Modules for Image Captioning

We do not speak word by word from scratch; our brain quickly structures a pattern like STH DO STH AT SOMEPLACE and then fills in the detailed descriptions. To render existing encoder-decoder image captioners such humanlike reasoning, we propose a novel framework: learning to Collocate Neural Modules...

Full description

Saved in:
Bibliographic Details
Published in2019 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 4249 - 4259
Main Authors Yang, Xu, Zhang, Hanwang, Cai, Jianfei
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We do not speak word by word from scratch; our brain quickly structures a pattern like STH DO STH AT SOMEPLACE and then fills in the detailed descriptions. To render existing encoder-decoder image captioners such humanlike reasoning, we propose a novel framework: learning to Collocate Neural Modules (CNM), to generate the "inner pattern" connecting visual encoder and language decoder. Unlike the widely-used neural module networks in visual Q&A, where the language (i.e., question) is fully observable, CNM for captioning is more challenging as the language is being generated and thus is partially observable. To this end, we make the following technical contributions for CNM training: 1) compact module design - one for function words and three for visual content words (e.g., noun, adjective, and verb), 2) soft module fusion and multistep module execution, robustifying the visual reasoning in partial observation, 3) a linguistic loss for module controller being faithful to part-of-speech collocations (e.g., adjective is before noun). Extensive experiments on the challenging MS-COCO image captioning benchmark validate the effectiveness of our CNM image captioner. In particular, CNM achieves a new state-of-the-art 127.9 CIDErD on Karpathy split and a single-model 126.0 c40 on the official server. CNM is also robust to few training samples, e.g., by training only one sentence per image, CNM can halve the performance loss compared to a strong baseline.
ISSN:2380-7504
DOI:10.1109/ICCV.2019.00435