Mitigating Dataset Bias in Image Captioning Through Clip Confounder-Free Captioning Network

The dataset bias has been identified as a major challenge in image captioning. When the image captioning model predicts a word, it should consider the visual evidence associated with the word, but the model tends to use contextual evidence from the dataset bias and results in biased captions, especi...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Image Processing (ICIP) pp. 1720 - 1724
Main Authors	Kim, Yeonju, Kim, Junho, Lee, Byung-Kwan, Shin, Sebin, Ro, Yong Man
Format	Conference Proceeding
Language	English
Published	IEEE 08.10.2023
Subjects	Benchmark testing Causal inference CLIP Context modeling Dataset bias Dictionaries Global visual confounder Image captioning Image processing Predictive models Transformers Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The dataset bias has been identified as a major challenge in image captioning. When the image captioning model predicts a word, it should consider the visual evidence associated with the word, but the model tends to use contextual evidence from the dataset bias and results in biased captions, especially when the dataset is biased toward some specific situations. To solve this problem, we approach from the causal inference perspective and design a causal graph. Based on the causal graph, we propose a novel method named C 2 Cap which is CLIP confounder-free captioning network. We use the global visual confounder to control the confounding factors in the image and train the model to produce debiased captions. We validate our proposed method on MSCOCO benchmark and demonstrate the effectiveness of our method. https://github.com/yeonju7kim/C2Cap
DOI:	10.1109/ICIP49359.2023.10222502