Image captioning for effective use of language models in knowledge-based visual question answering
Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automati...
Saved in:
Published in | Expert systems with applications Vol. 212; p. 118669 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.02.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.
•Captions are more effective than images for OK-VQA, a knowledge intensive VQA task.•Increasing the capacity of language models allows to reach state-of-the-art results.•Our best model obtains results comparable to five GPT-3 runs which are 15x larger.•Our system is effective when external knowledge is needed. |
---|---|
ISSN: | 0957-4174 1873-6793 |
DOI: | 10.1016/j.eswa.2022.118669 |