Gaze prediction based on long short-term memory convolution with associated features of video frames

Gaze prediction is a key issue for visual perception research. It can be used to infer important regions in videos to reduce the amount of computation in learning and inference of various analysis tasks. Vanilla methods for dynamic video unable to extract valid features, and the motion information a...

Full description

Saved in:

Bibliographic Details
Published in	Computers & electrical engineering Vol. 107; p. 108625
Main Authors	Xiao, Limei, Zhu, Zizhong, Liu, Hao, Li, Ce, Fu, Wenhao
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.04.2023
Subjects	Central prior knowledge Convolutional LSTM Dynamic video Gaze prediction Central prior knowledge Gaze prediction Dynamic video Convolutional LSTM
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Gaze prediction is a key issue for visual perception research. It can be used to infer important regions in videos to reduce the amount of computation in learning and inference of various analysis tasks. Vanilla methods for dynamic video unable to extract valid features, and the motion information among dynamic video frames are ignored, which lead to poor prediction results. We propose a gaze prediction based on LSTM convolution with associated features of video frames (LSTM-CVFAF). Firstly, by adding learnable central prior knowledge, the proposed method can effectively and accurately extract the spatial information of each frame. Secondly, the LSTM is deployed to get temporal motion gaze features. Finally, the spatial and temporal motion information is fused to generate the gaze prediction maps of the dynamic video. Compared with the state-of-art models on DHF1K dataset, the CC, AUC-j, sAUC, NSS are separately increased by 5.1%, 0.6%, 38.2% and 0.5%. [Display omitted] •This paper proposes a gaze prediction algorithm based on LSTM convolution based on video frame correlation features (LSTM-CVFAF).•In SGP-Net, the convolution Gaussian prior layer is used to simulate the bias phenomenon in eye fixation.And a TGP-Net composed of multiple ConvLSTM layers is proposed to learn temporal motion feature information between video frames.•Fu-Net fuses the position and motion information extracted from SGP-Net and TGP-Net in a self-learning way, so as to obtain a more accurate video gaze prediction map.
ISSN:	0045-7906 1879-0755
DOI:	10.1016/j.compeleceng.2023.108625