Improving visual question answering by combining scene-text information
The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) mod...
Saved in:
Published in | Multimedia tools and applications Vol. 81; no. 9; pp. 12177 - 12208 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.04.2022
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The text present in natural scenes contains semantic information about its surrounding environment. For example, the majority of questions asked by blind people related to images around them require understanding of text in the image. However, most of the existing Visual Question Answering (VQA) models do not consider the text present in an image. In this paper, the proposed model fuses the multiple inputs such as visual features, questions features and OCR tokens. Also, we have captured the relationship between OCR tokens and the objects in an image, which previous model fail to use. As compared to previous model on TextVQA dataset, the proposed model uses dynamic pointer networks based decoder to predict multi-word (OCR tokens and words from fixed vocabulary) answers instead of single-step classification task. OCR tokens are represented using location, appearance, phoc and fisher vectors features in addition to the FastText features used by previous model on TextVQA. A powerful descriptor is constructed by applying Fisher Vectors (FV) which is computed from PHOCs of the text present in images. This FV based feature representation is better than the feature representation based on word embeddings only, which are used by previous state-of-the-art models. Quantitative and qualitative experiments performed on popular benchmarks including TextVQA, ST-VQA and VQA 2.0 reveal the efficacy of proposed model. Our proposed VQA model attains 41.23% on TextVQA dataset, 40.98% on ST-VQA dataset and 74.98% overall accuracy on VQA 2.0 dataset. Results suggest that there is a significant gap between human accuracy and model accuracy on TextVQA and ST-VQA datasets compared to VQA 2.0, recommending the use of TextVQA and ST-VQA datasets for future research which can complement VQA 2.0. |
---|---|
ISSN: | 1380-7501 1573-7721 |
DOI: | 10.1007/s11042-022-12317-0 |