CLIP-Driven Distinctive Interactive Transformer for Image Difference Captioning
Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily con...
Saved in:
Published in | 2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC) pp. 1232 - 1236 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
17.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Image difference captioning has attracted growing attention from industry and academia in recent years. Compared with traditional image captioning (IC), image difference captioning (IDC) is more challenging because it needs to locate the differences and describe them. Previous research primarily concentrates on extracting subtle visual difference using convolutional-based networks and generating distinctive captions through natural language models. Despite significant advancements achieved by these works, they have noticeably amplified the modal disparity between text and images, affecting the generation of difference descriptions. Taking advantage of the strengths of CLIP and Transformer, we propose a CLIP-Driven Distinctive Interactive Transformer (CLIP-DIT) for image difference captioning. Technically, our CLIP-DIT incorporate a buffer linear embedding to establish an effective connection between CLIP and Transformer. Additionally, recognizing the characteristics of IDC, we design a distinctive interactive attention to enhance the focus on visual difference. Extensive experiments are carried out on CLVER-Change dataset and experimental results show that our CLIP-DIT generate more accurate difference captioning than seven advanced methods. |
---|---|
DOI: | 10.1109/ICFTIC59930.2023.10455855 |