Improving visual question answering by combining scene-text information

The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) mod...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 81; no. 9; pp. 12177 - 12208
Main Authors	Sharma, Himanshu, Jalal, Anand Singh
Format	Journal Article
Language	English
Published	New York Springer US 01.04.2022 Springer Nature B.V
Subjects	1177: Advances in Deep Learning for Multimodal Fusion and Alignment Accuracy Blind people Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Model accuracy Multimedia Information Systems Questions Representations Special Purpose and Application-Based Systems Computer vision Natural language processing World knowledge Visual question answering (VQA) Commonsense reasoning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) models do not consider the text present in an image. In this paper, the proposed model fuses the multiple inputs such as visual features, questions features and OCR tokens. Also, we have captured the relationship between OCR tokens and the objects in an image, which previous model fail to use. As compared to previous model on TextVQA dataset, the proposed model uses dynamic pointer networks based decoder to predict multi-word (OCR tokens and words from fixed vocabulary) answers instead of single-step classification task. OCR tokens are represented using location, appearance, phoc and fisher vectors features in addition to the FastText features used by previous model on TextVQA. A powerful descriptor is constructed by applying Fisher Vectors (FV) which is computed from PHOCs of the text present in images. This FV based feature representation is better than the feature representation based on word embeddings only, which are used by previous state-of-the-art models. Quantitative and qualitative experiments performed on popular benchmarks including TextVQA, ST-VQA and VQA 2.0 reveal the efficacy of proposed model. Our proposed VQA model attains 41.23% on TextVQA dataset, 40.98% on ST-VQA dataset and 74.98% overall accuracy on VQA 2.0 dataset. Results suggest that there is a significant gap between human accuracy and model accuracy on TextVQA and ST-VQA datasets compared to VQA 2.0, recommending the use of TextVQA and ST-VQA datasets for future research which can complement VQA 2.0.
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-022-12317-0