Video Saliency Forecasting Transformer

Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the potential of this task has not been fully exploited since existing VSP methods only focus on modeling visual saliency of the input previous frames. In this paper, we present the first attempt to extend this task to...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 32; no. 10; pp. 6850 - 6862
Main Authors Ma, Cheng, Sun, Haowen, Rao, Yongming, Zhou, Jie, Lu, Jiwen
Format Journal Article
LanguageEnglish
Published New York IEEE 01.10.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the potential of this task has not been fully exploited since existing VSP methods only focus on modeling visual saliency of the input previous frames. In this paper, we present the first attempt to extend this task to video saliency forecasting (VSF) by forecasting attention regions of consecutive future frames. To tackle this problem, we propose a video saliency forecasting transformer (VSFT) network built on a new encoder-decoder architecture. Different from existing VSP methods, our VSFT is the first pure-transformer based architecture in the VSP field and is freed from the dependency of the pretrained S3D model. In VSFT, the attention mechanism is exploited to capture spatial-temporal dependencies between the input past frames and the target future frame. We propose cross-attention guidance blocks (CAGB) to aggregate multi-level representation features to provide sufficient guidance for forecasting. We conduct comprehensive experiments on two benchmark datasets, DHF1K and Hollywoods-2. We investigate the saliency forecasting and predicting abilities of existing VSP methods by modifying the supervision signals. Experimental results demonstrate that our method achieves superior performance on both VSF and VSP tasks.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2022.3172971