360 ^ Image Saliency Prediction by Embedding Self-Supervised Proxy Task

The development of Metaverse industry produces many 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> images and videos. Transmitting these images or videos efficiently is the key to success of Metaverse. Since the subject's f...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on broadcasting Vol. 69; no. 3; pp. 1 - 11
Main Authors	Zou, Zizhuang, Ye, Mao, Li, Shuai, Li, Xue, Dufaux, Frederic
Format	Journal Article
Language	English
Published	New York IEEE 01.09.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers
Subjects	360<inline-formula xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <tex-math notation="LaTeX"> ^{\circ}</tex-math> </inline-formula> image Coders Computer Science Datasets Decoding Embedding Faces Feature extraction Image Processing Image transmission Industrial development Machine learning proxy task Salience Saliency detection saliency prediction Self-supervised learning Task analysis Training Video 360° image saliency prediction proxy task
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The development of Metaverse industry produces many 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> images and videos. Transmitting these images or videos efficiently is the key to success of Metaverse. Since the subject's field of view is limited in Metaverse, from the perception perspective, bit rates can be saved by focusing video encoding on salient regions. On different ways of handling 360<inline-formula> <tex-math notation="LaTeX">^{\circ}</tex-math> </inline-formula> image projections, the existing works either consider combining local and global projections or just use only global projection for saliency prediction, which results in slow detection speed or low accuracy. In this work, we address this problem by Embedding a self-supervised Proxy task in the Saliency prediction Network, dubbed as EPSNet . The main architecture follows an autoencoder with an encoder for feature extraction and a decoder for saliency prediction. The proxy task is combined with the encoder to enforce it to learn local and global information. It is designed to find the location of a certain local projection in the global projection via self-supervised learning. A cross-attention fusion mechanism is used to fuse the global and local features for location prediction. Then, the decoder is trained based on the sole global projection. In this way, the time-consuming local-global feature fusion is placed in the training stage only. Experiments on public dataset show that our method has achieved satisfactory results in terms of inference speed and accuracy. The dataset and code are available at https://github.com/zzz0326/EPSNet.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9316 1557-9611
DOI:	10.1109/TBC.2023.3254143