Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model

The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to per...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 7; pp. 175929 - 175939
Main Authors Oluwasanmi, Ariyo, Frimpong, Enoch, Aftab, Muhammad Umar, Baagyere, Edward Y., Qin, Zhiguang, Ullah, Kifayat
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to perform visual feature extractions, compute the feature distances, and generate new sentences describing the measured distances. After extracting the features of the images, a contrastive function is used to compute their weighted L1 distance which is learned and selectively attended to determine salient sections of the feature at every time step. The attended feature region is adequately matched to corresponding words iteratively until a sentence is completed. We propose the application of upsampling network to enlarge the features' field of view, this provides a robust pixel-based discrepancy computation. Our extensive experiments indicate that the FCC model outperforms other learning models on the benchmark Spot-the-Diff datasets by generating succinct and meaningful textual differences in images.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2019.2957513