DeepCT: A novel deep complex-valued network with learnable transform for video saliency prediction

•We propose a new structure of DeepCT for video saliency prediction, in which the complexvalued CNN and Convolutional LSTM are integrated.•We propose learning multi-scale spatio-temporal transforms, through the developed complexvalued transform and inverse complex-valued transform modules.•We formul...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 102; p. 107234
Main Authors Jiang, Lai, Xu, Mai, Zhang, Shanyi, Sigal, Leonid
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•We propose a new structure of DeepCT for video saliency prediction, in which the complexvalued CNN and Convolutional LSTM are integrated.•We propose learning multi-scale spatio-temporal transforms, through the developed complexvalued transform and inverse complex-valued transform modules.•We formulate the learnable transforms through a cycle consistency loss, such that transform and in-transform can be paired by minimizing reconstruction errors in both pixel and transformed domains.•We evaluate the saliency prediction accuracy of our method over 2 databases and 4 metrics, as well as statistical analysis. The experimental results show that our method outperform 13 other state-ofthe-art methods. The past decade has witnessed the success of transformed domain methods for image saliency prediction. However, it is intractable to develop a transformed domain method for video saliency prediction, due to the limited choices on spatio-temporal transforms. In this paper, we propose learning the transform from training data, rather than the predefined transform in the existing methods. Specifically, we develop a novel deep Complex-valued network with learnable Transform (DeepCT) for video saliency prediction. The architecture of DeepCT includes the Complex-valued Transform Module (CTM), inverse CTM (iCTM) and Complex-valued Stacked Convolutional Long Short-Term Memory network (CS-ConvLSTM). In the CTM and iCTM, multi-scale pyramid structures are introduced, as we find that transforms at multiple receptive scales can improve the accuracy of saliency prediction. To make the CTM and iCTM “invertible”, we further propose the cycle consistency loss in training DeepCT, which is composed of frame reconstruction loss and complex feature reconstruction loss. Additionally, the CS-ConvLSTM is developed to learn the temporal saliency transition across video frames. Finally, the experimental results show that our DeepCT method outperforms other 13 state-of-the-art methods for video saliency prediction.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2020.107234