Vman: visual-modified attention network for multimodal paradigms Vman: visual-modified attention network for multimodal paradigms

Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving t...

Full description

Saved in:

Bibliographic Details
Published in	The Visual computer Vol. 41; no. 4; pp. 2737 - 2754
Main Authors	Song, Xiaoyu, Han, Dezhi, Chen, Chongqing, Shen, Xiang, Wu, Huafeng
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.03.2025 Springer Nature B.V
Subjects	Artificial Intelligence Attention Computer Graphics Computer Science Design Image Processing and Computer Vision Language Localization Modelling Natural language Proposals Sensors Dependency modeling Transformer Visual question answering (VQA) Vision-language task (VLT) Visual grounding (VG)
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!