Mutual-Guidance Transformer-Embedding Network for Video Salient Object Detection

Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies....

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 29; pp. 1674 - 1678
Main Authors	Min, Dingyao, Zhang, Chao, Lu, Yukang, Fu, Keren, Zhao, Qijun
Format	Journal Article
Language	English
Published	New York IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Embedding Feature extraction Modules mutual-guidance Object detection Object recognition Optical flow (image analysis) Optical imaging Radio frequency Salience Saliency detection Task analysis transformer Transformers video salient object detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies. To take full advantage of CNNs and recently emerged Transformers, this letter proposes a novel mutual-guidance Transformer-embedding network, called MGT-Net, where a mutual-guidance multi-head attention mechanism (MGMA) explores more sophisticated long-range cross-modal interactions. Such a mechanism is designed into a new mutual-guidance Transformer (MGTrans) module that can propagate long-range contextual dependencies based on information of the other modality. To the best of our knowledge, MGT-Net is the first VSOD model that embeds Transformers as modules into CNNs for improved performance. Prior to MGTrans, we also propose and deploy a feature purification module (FPM) to purify noisy backbone features. Experimental results on five benchmark datasets demonstrate the state-of-the-art performance of MGT-Net.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2022.3192753