Mutual-Guidance Transformer-Embedding Network for Video Salient Object Detection

Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies....

Full description

Saved in:
Bibliographic Details
Published inIEEE signal processing letters Vol. 29; pp. 1674 - 1678
Main Authors Min, Dingyao, Zhang, Chao, Lu, Yukang, Fu, Keren, Zhao, Qijun
Format Journal Article
LanguageEnglish
Published New York IEEE 2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies. To take full advantage of CNNs and recently emerged Transformers, this letter proposes a novel mutual-guidance Transformer-embedding network, called MGT-Net, where a mutual-guidance multi-head attention mechanism (MGMA) explores more sophisticated long-range cross-modal interactions. Such a mechanism is designed into a new mutual-guidance Transformer (MGTrans) module that can propagate long-range contextual dependencies based on information of the other modality. To the best of our knowledge, MGT-Net is the first VSOD model that embeds Transformers as modules into CNNs for improved performance. Prior to MGTrans, we also propose and deploy a feature purification module (FPM) to purify noisy backbone features. Experimental results on five benchmark datasets demonstrate the state-of-the-art performance of MGT-Net.
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2022.3192753