Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model

The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to per...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 7; pp. 175929 - 175939
Main Authors	Oluwasanmi, Ariyo, Frimpong, Enoch, Aftab, Muhammad Umar, Baagyere, Edward Y., Qin, Zhiguang, Ullah, Kifayat
Format	Journal Article
Language	English
Published	Piscataway IEEE 2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	attention Coders Computational modeling Computer architecture Computer vision convolutional neural network Decoding deep learning Encoders-Decoders Feature extraction Field of view fully convolutional networks Image captioning Natural language (computers) Natural languages recurrent neural network Semantics Sentences Siamese network Task analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The generation of the textual description of the differences in images is a relatively new concept that requires the fusion of both computer vision and natural language techniques. In this paper, we present a novel Fully Convolutional CaptionNet (FCC) that employs an encoder-decoder framework to perform visual feature extractions, compute the feature distances, and generate new sentences describing the measured distances. After extracting the features of the images, a contrastive function is used to compute their weighted L1 distance which is learned and selectively attended to determine salient sections of the feature at every time step. The attended feature region is adequately matched to corresponding words iteratively until a sentence is completed. We propose the application of upsampling network to enlarge the features' field of view, this provides a robust pixel-based discrepancy computation. Our extensive experiments indicate that the FCC model outperforms other learning models on the benchmark Spot-the-Diff datasets by generating succinct and meaningful textual differences in images.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2019.2957513