CLIP-Driven Distinctive Interactive Transformer for Image Difference Captioning

Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily con...

Full description

Saved in:

Bibliographic Details
Published in	2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC) pp. 1232 - 1236
Main Authors	Hu, Jinhong, Zhang, Benqi, Chen, Ying
Format	Conference Proceeding
Language	English
Published	IEEE 17.11.2023
Subjects	CLIP Image Difference Captioning Transformer
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily concentrates on extracting subtle visual difference using convolutional-based networks and generating distinctive captions through natural language models. Despite significant advancements achieved by these works, they have noticeably amplified the modal disparity between text and images, affecting the generation of difference descriptions. Taking advantage of the strengths of CLIP and Transformer, we propose a CLIP-Driven Distinctive Interactive Transformer (CLIP-DIT) for image difference captioning. Technically, our CLIP-DIT incorporate a buffer linear embedding to establish an effective connection between CLIP and Transformer. Additionally, recognizing the characteristics of IDC, we design a distinctive interactive attention to enhance the focus on visual difference. Extensive experiments are carried out on CLVER-Change dataset and experimental results show that our CLIP-DIT generate more accurate difference captioning than seven advanced methods.
DOI:	10.1109/ICFTIC59930.2023.10455855