SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper p...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 11; pp. 140967 - 140980
Main Authors	Alasmary, Faris, Al-Ahmadi, Saad
Format	Journal Article
Language	English
Published	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Datasets Generators Machine learning Modules multimodal Multisensory integration Neural networks question answering Question answering (information retrieval) Questions Recording Robustness Signal to noise ratio Source code Speech Speech encoders Speech recognition Speech-based visual question answering (SBVQA) Task analysis visual question answering (VQA) Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper presents SBVQA 2.0, a robust multimodal neural network architecture that integrates information from both the visual and the speech domains. SBVQA 2.0 is composed of four modules: speech encoder, image encoder, features fusor, and answer generator. The speech encoder extracts semantic information from spoken questions, and the image encoder extracts visual information from images. The outputs of the two modules are combined using the features fusor and then processed by the answer generator to predict the answer. Although SBVQA 2.0 was trained on a single-speaker dataset with a clean background, we show that our selected speech encoder is more robust to noise and is speaker-independent. Moreover, we demonstrate that SBVQA 2.0 can be further improved by finetuning in an end-to-end manner since it uses fully differentiable modules. We open-source our pretrained models, source code, and dataset for the research community.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3339537