Video Saliency Forecasting Transformer

Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the potential of this task has not been fully exploited since existing VSP methods only focus on modeling visual saliency of the input previous frames. In this paper, we present the first attempt to extend this task to...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 32; no. 10; pp. 6850 - 6862
Main Authors	Ma, Cheng, Sun, Haowen, Rao, Yongming, Zhou, Jie, Lu, Jiwen
Format	Journal Article
Language	English
Published	New York IEEE 01.10.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	boldsymbol Video saliency forecasting Cameras Coders Correlation Encoders-Decoders Forecasting Frames (data processing) Mathematical models Predictive models Salience Task analysis transformer Transformers video saliency prediction Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video saliency prediction (VSP) aims to imitate eye fixations of humans. However, the potential of this task has not been fully exploited since existing VSP methods only focus on modeling visual saliency of the input previous frames. In this paper, we present the first attempt to extend this task to video saliency forecasting (VSF) by forecasting attention regions of consecutive future frames. To tackle this problem, we propose a video saliency forecasting transformer (VSFT) network built on a new encoder-decoder architecture. Different from existing VSP methods, our VSFT is the first pure-transformer based architecture in the VSP field and is freed from the dependency of the pretrained S3D model. In VSFT, the attention mechanism is exploited to capture spatial-temporal dependencies between the input past frames and the target future frame. We propose cross-attention guidance blocks (CAGB) to aggregate multi-level representation features to provide sufficient guidance for forecasting. We conduct comprehensive experiments on two benchmark datasets, DHF1K and Hollywoods-2. We investigate the saliency forecasting and predicting abilities of existing VSP methods by modifying the supervision signals. Experimental results demonstrate that our method achieves superior performance on both VSF and VSP tasks.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2022.3172971