PGCL: Prompt guidance and self-supervised contrastive learning-based method for Visual Question Answering

Recent works have demonstrated the efficacy of Chain-of-Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 251; p. 124011
Main Authors	Gao, Ling, Zhang, Hongda, Liu, Yiming, Sheng, Nan, Feng, Haotian, Xu, Hao
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.10.2024
Subjects	Contrastive learning Prompt Transformer Visual question answering Prompt Transformer Visual question answering Contrastive learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recent works have demonstrated the efficacy of Chain-of-Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT in science-oriented VQA primarily concentrates on the extraction and integration of visual and textual information. However, they overlook the fact that image-question pairs, categorized by different attributes (such as subject, topic, category, skill, grade, and difficulty), emphasize distinct text information, visual information, and reasoning capabilities. Therefore, this work proposes a novel VQA method termed PGCL, founded on the prompt guidance strategy and self-supervised contrastive learning. PGCL strategically excavates and integrates text and visual information based on attribute information. Specifically, two prompt templates are first crafted. They are subsequently combined with the attribution information and the interference information of image-question pairs to generate a series of prompt positive and prompt negative samples respectively. The mining of visual and text representations is then guided by constructed prompts. These prompt-guided representations are integrated and enhanced via transformer architecture and self-supervised contrastive learning. The fused features are eventually learned to predict answers for VQA. Sufficient experiments have convincingly substantiated the individual contributions of the components within PGCL, as well as the performance of PGCL. •We propose a novel visual question answering model for science curricula.•This method guides the mining of information according to the constructed prompts.•This method also enhances the fusion of information by contrastive learning.•Extensive experiments demonstrate the superiority of our approach.
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2024.124011