Deep Residual Weight-Sharing Attention Network With Low-Rank Attention for Visual Question Answering

The attention-based networks have become prevailing recently in visual question answering (VQA) due to their high performances. However, the extensive memory consumption of attention-based models poses excessive-high demand for the implementation equipment, raising concerns about their future applic...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 25; pp. 4282 - 4295
Main Authors	Qin, Bosheng, Hu, Haoji, Zhuang, Yueting
Format	Journal Article
Language	English
Published	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation Approximation algorithms Feature extraction Lightweight low-rank attention Mathematical models Parameters Qualitative analysis Questions residual learning Routing Statistical analysis Task analysis Transformers Visual question answering Visualization Weight reduction weight-sharing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The attention-based networks have become prevailing recently in visual question answering (VQA) due to their high performances. However, the extensive memory consumption of attention-based models poses excessive-high demand for the implementation equipment, raising concerns about their future application scenarios. Therefore, designing an efficient and lightweight VQA model is central to expanding possible application areas. Our work presents a novel lightweight attention-based VQA model, namely residual weight-sharing attention network (RWSAN), consisting of residual weight-sharing attention (RWSA) layers cascaded in depth. Each RWSA layer models the textual representation with self residual weight-sharing attention (SRWSA) and captures question features and question-image interactions with self-guided residual weight-sharing attention (SGRWSA). Inside each RWSA layer, the proposed low-rank attention (LRA) units perform residual learning with learned connection patterns and shared parameters, and every stacked RWSA layer also uses the same parameters. Extensive ablation experiments with quantitative and qualitative analysis are conducted to illustrate the effectiveness and generality of RWSA. Experiments on VQA-v2, GQA, and CLEVR datasets show that the RWSAN achieves competitive performance with much fewer parameters over the state-of-the-art methods. We release our code at https://github.com/BrightQin/RWSAN .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3173131