CE-BART: Cause-and-Effect BART for Visual Commonsense Generation

"A Picture is worth a thousand words". Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1...

Full description

Saved in:

Bibliographic Details
Published in	Sensors (Basel, Switzerland) Vol. 22; no. 23; p. 9399
Main Authors	Kim, Junyeong, Hong, Ji Woo, Yoon, Sunjae, Yoo, Chang D
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 02.12.2022 MDPI
Subjects	Ablation AVSD Benchmarks Causality Cognition & reasoning Computational linguistics deep learning Graph representations Humans Knowledge Language Language processing Learning Natural language interfaces Natural language processing Qualitative analysis Semantics video-grounded dialogue visual commonsense generation Visual effects Visual tasks VisualCOMET visual–language reasoning AVSD deep learning visual commonsense generation video-grounded dialogue visual–language reasoning VisualCOMET
Online Access	Get full text

Cover

Loading…

More Information
Summary:	"A Picture is worth a thousand words". Given an image, humans are able to deduce various cause-and-effect captions of past, current, and future events beyond the image. The task of visual commonsense generation has the aim of generating three cause-and-effect captions for a given image: (1) what needed to happen before, (2) what is the current intent, and (3) what will happen after. However, this task is challenging for machines, owing to two limitations: existing approaches (1) directly utilize conventional vision-language transformers to learn relationships between input modalities and (2) ignore relations among target cause-and-effect captions, but consider each caption independently. Herein, we propose Cause-and-Effect BART (CE-BART), which is based on (1) a structured graph reasoner that captures intra- and inter-modality relationships among visual and textual representations and (2) a cause-and-effect generator that generates cause-and-effect captions by considering the causal relations among inferences. We demonstrate the validity of CE-BART on the VisualCOMET and AVSD benchmarks. CE-BART achieved SOTA performance on both benchmarks, while an extensive ablation study and qualitative analysis demonstrated the performance gain and improved interpretability.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s22239399