Mutual-Guidance Transformer-Embedding Network for Video Salient Object Detection
Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies....
Saved in:
Published in | IEEE signal processing letters Vol. 29; pp. 1674 - 1678 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies. To take full advantage of CNNs and recently emerged Transformers, this letter proposes a novel mutual-guidance Transformer-embedding network, called MGT-Net, where a mutual-guidance multi-head attention mechanism (MGMA) explores more sophisticated long-range cross-modal interactions. Such a mechanism is designed into a new mutual-guidance Transformer (MGTrans) module that can propagate long-range contextual dependencies based on information of the other modality. To the best of our knowledge, MGT-Net is the first VSOD model that embeds Transformers as modules into CNNs for improved performance. Prior to MGTrans, we also propose and deploy a feature purification module (FPM) to purify noisy backbone features. Experimental results on five benchmark datasets demonstrate the state-of-the-art performance of MGT-Net. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2022.3192753 |