SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper p...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 11; pp. 140967 - 140980
Main Authors Alasmary, Faris, Al-Ahmadi, Saad
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects in the image and their locations. This paper presents SBVQA 2.0, a robust multimodal neural network architecture that integrates information from both the visual and the speech domains. SBVQA 2.0 is composed of four modules: speech encoder, image encoder, features fusor, and answer generator. The speech encoder extracts semantic information from spoken questions, and the image encoder extracts visual information from images. The outputs of the two modules are combined using the features fusor and then processed by the answer generator to predict the answer. Although SBVQA 2.0 was trained on a single-speaker dataset with a clean background, we show that our selected speech encoder is more robust to noise and is speaker-independent. Moreover, we demonstrate that SBVQA 2.0 can be further improved by finetuning in an end-to-end manner since it uses fully differentiable modules. We open-source our pretrained models, source code, and dataset for the research community.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2023.3339537