CLIP-Driven Distinctive Interactive Transformer for Image Difference Captioning

Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily con...

Full description

Saved in:
Bibliographic Details
Published in2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC) pp. 1232 - 1236
Main Authors Hu, Jinhong, Zhang, Benqi, Chen, Ying
Format Conference Proceeding
LanguageEnglish
Published IEEE 17.11.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily concentrates on extracting subtle visual difference using convolutional-based networks and generating distinctive captions through natural language models. Despite significant advancements achieved by these works, they have noticeably amplified the modal disparity between text and images, affecting the generation of difference descriptions. Taking advantage of the strengths of CLIP and Transformer, we propose a CLIP-Driven Distinctive Interactive Transformer (CLIP-DIT) for image difference captioning. Technically, our CLIP-DIT incorporate a buffer linear embedding to establish an effective connection between CLIP and Transformer. Additionally, recognizing the characteristics of IDC, we design a distinctive interactive attention to enhance the focus on visual difference. Extensive experiments are carried out on CLVER-Change dataset and experimental results show that our CLIP-DIT generate more accurate difference captioning than seven advanced methods.
DOI:10.1109/ICFTIC59930.2023.10455855