CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 32; no. 6; pp. 2676 - 2690
Main Authors	Ji, Yuzhu, Zhang, Haijun, Jie, Zequn, Ma, Lin, Jonathan Wu, Q. M.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation Artificial neural networks Coders Computational modeling Computer networks Consistency Cross attention Data models Datasets Feature extraction inter and intraframe saliency Object detection Object oriented modeling Object recognition Optical imaging Salience Saliency detection salient object Spatial data Video data video saliency
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency of saliency detection, this article presents a novel cross-attention based encoder-decoder model under the Siamese framework (CASNet) for video salient object detection. A baseline encoder-decoder model trained with Lovász softmax loss function is adopted as a backbone network to guarantee the accuracy of intraframe salient object detection. Self- and cross-attention modules are incorporated into our model in order to preserve the saliency correlation and improve intraframe salient detection consistency. Extensive experimental results obtained by ablation analysis and cross-data set validation demonstrate the effectiveness of our proposed method. Quantitative results indicate that our CASNet model outperforms 19 state-of-the-art image- and video-based methods on six benchmark data sets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2020.3007534