Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Reporting bias arises when people assume that some knowledge is universally understood and hence, do not necessitate explicit elaboration. In this paper, we focus on the wide existence of reporting bias in visual-language datasets, embodied as the object-attribute association, which can subsequentia...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Reporting bias arises when people assume that some knowledge is universally
understood and hence, do not necessitate explicit elaboration. In this paper,
we focus on the wide existence of reporting bias in visual-language datasets,
embodied as the object-attribute association, which can subsequentially degrade
models trained on them. To mitigate this bias, we propose a bimodal
augmentation (BiAug) approach through object-attribute decoupling to flexibly
synthesize visual-language examples with a rich array of object-attribute
pairing and construct cross-modal hard negatives. We employ large language
models (LLMs) in conjunction with a grounding object detector to extract target
objects. Subsequently, the LLM generates a detailed attribute description for
each object and produces a corresponding hard negative counterpart. An
inpainting model is then used to create images based on these detailed object
descriptions. By doing so, the synthesized examples explicitly complement
omitted objects and attributes to learn, and the hard negative pairs steer the
model to distinguish object attributes. Our experiments demonstrated that BiAug
is superior in object-attribute understanding. In addition, BiAug also improves
the performance on zero-shot retrieval tasks on general benchmarks like MSCOCO
and Flickr30K. BiAug refines the way of collecting text-image datasets.
Mitigating the reporting bias helps models achieve a deeper understanding of
visual-language phenomena, expanding beyond mere frequent patterns to encompass
the richness and diversity of real-world scenarios. |
---|---|
DOI: | 10.48550/arxiv.2310.01330 |