PGCL: Prompt guidance and self-supervised contrastive learning-based method for Visual Question Answering
Recent works have demonstrated the efficacy of Chain-of-Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT...
Saved in:
Published in | Expert systems with applications Vol. 251; p. 124011 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recent works have demonstrated the efficacy of Chain-of-Thought (CoT), which comprises multimodal information, in multiple complex reasoning tasks. CoT, involving multiple stages of reasoning, has also been applied to Visual Question Answering (VQA) for scientific questions. Existing research on CoT in science-oriented VQA primarily concentrates on the extraction and integration of visual and textual information. However, they overlook the fact that image-question pairs, categorized by different attributes (such as subject, topic, category, skill, grade, and difficulty), emphasize distinct text information, visual information, and reasoning capabilities. Therefore, this work proposes a novel VQA method termed PGCL, founded on the prompt guidance strategy and self-supervised contrastive learning. PGCL strategically excavates and integrates text and visual information based on attribute information. Specifically, two prompt templates are first crafted. They are subsequently combined with the attribution information and the interference information of image-question pairs to generate a series of prompt positive and prompt negative samples respectively. The mining of visual and text representations is then guided by constructed prompts. These prompt-guided representations are integrated and enhanced via transformer architecture and self-supervised contrastive learning. The fused features are eventually learned to predict answers for VQA. Sufficient experiments have convincingly substantiated the individual contributions of the components within PGCL, as well as the performance of PGCL.
•We propose a novel visual question answering model for science curricula.•This method guides the mining of information according to the constructed prompts.•This method also enhances the fusion of information by contrastive learning.•Extensive experiments demonstrate the superiority of our approach. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2024.124011 |