360 ^ Image Saliency Prediction by Embedding Self-Supervised Proxy Task

The development of Metaverse industry produces many 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> images and videos. Transmitting these images or videos efficiently is the key to success of Metaverse. Since the subject's f...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on broadcasting Vol. 69; no. 3; pp. 1 - 11
Main Authors Zou, Zizhuang, Ye, Mao, Li, Shuai, Li, Xue, Dufaux, Frederic
Format Journal Article
LanguageEnglish
Published New York IEEE 01.09.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The development of Metaverse industry produces many 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> images and videos. Transmitting these images or videos efficiently is the key to success of Metaverse. Since the subject's field of view is limited in Metaverse, from the perception perspective, bit rates can be saved by focusing video encoding on salient regions. On different ways of handling 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> image projections, the existing works either consider combining local and global projections or just use only global projection for saliency prediction, which results in slow detection speed or low accuracy. In this work, we address this problem by Embedding a self-supervised Proxy task in the Saliency prediction Network, dubbed as EPSNet . The main architecture follows an autoencoder with an encoder for feature extraction and a decoder for saliency prediction. The proxy task is combined with the encoder to enforce it to learn local and global information. It is designed to find the location of a certain local projection in the global projection via self-supervised learning. A cross-attention fusion mechanism is used to fuse the global and local features for location prediction. Then, the decoder is trained based on the sole global projection. In this way, the time-consuming local-global feature fusion is placed in the training stage only. Experiments on public dataset show that our method has achieved satisfactory results in terms of inference speed and accuracy. The dataset and code are available at https://github.com/zzz0326/EPSNet.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9316
1557-9611
DOI:10.1109/TBC.2023.3254143