Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks

This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e., effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motio...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 29; pp. 1113 - 1126
Main Authors	Lai, Qiuxia, Wang, Wenguan, Sun, Hanqiu, Shen, Jianbing
Format	Journal Article
Language	English
Published	United States IEEE 01.01.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	attention mechanism Computational modeling Data exchange Data models deep learning Domains Dynamic eye-fixation prediction Dynamics Learning Multilayers Predictions Predictive models residual attentive learning Salience Spatiotemporal phenomena Task analysis video saliency Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper proposes a novel residual attentive learning network architecture for predicting dynamic eye-fixation maps. The proposed model emphasizes two essential issues, i.e., effective spatiotemporal feature integration and multi-scale saliency learning. For the first problem, appearance and motion streams are tightly coupled via dense residual cross connections, which integrate appearance information with multi-layer, comprehensive motion features in a residual and dense way. Beyond traditional two-stream models learning appearance and motion features separately, such design allows early, multi-path information exchange between different domains, leading to a unified and powerful spatiotemporal learning architecture. For the second one, we propose a composite attention mechanism that learns multi-scale local attentions and global attention priors end-to-end. It is used for enhancing the fused spatiotemporal features via emphasizing important features in multi-scales. A lightweight convolutional Gated Recurrent Unit (convGRU), which is flexible for small training data situation, is used for long-term temporal characteristics modeling. Extensive experiments over four benchmark datasets clearly demonstrate the advantage of the proposed video saliency model over other competitors and the effectiveness of each component of our network. Our code and all the results will be available at https://github.com/ashleylqx/STRA-Net.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2019.2936112