High-Fidelity Pluralistic Image Completion with PLSA-VQGAN

The process of image completion is essentially conditional image generation, and as such it accepts diverse results, as long as reasonably coherent semantic information and realistic textures can be achieved. Pluralistic Image Completion(PIC) pioneered the generation of diverse results for the image...

Full description

Saved in:
Bibliographic Details
Published in2022 International Conference on High Performance Big Data and Intelligent Systems (HDIS) pp. 244 - 248
Main Authors Wang, Tingran, Li, Ce, Qiao, Jingyi, Wei, Xianlong, Tang, Zhengyan, Tian, Yihan
Format Conference Proceeding
LanguageEnglish
Published IEEE 10.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The process of image completion is essentially conditional image generation, and as such it accepts diverse results, as long as reasonably coherent semantic information and realistic textures can be achieved. Pluralistic Image Completion(PIC) pioneered the generation of diverse results for the image completion task, but it suffers from the inductive bias of Convolutional Neural Networks(CNNs) and performs poorly in understanding the global structure, resulting in slight semantic variation in completion results. More recently, some hybrid architectures that combine the transformer's ability to acquire long-range dependencies with the CNNs's powerful texture modeling capabilities have performed well. However, these methods use auto encoder to acquire token representations of images and still suffer from texture distortion and poor articulation quality of filled regions due to image reconstruction losses, such as High-Fidelity Pluralistic Image Completion with Transformers(ICt). In this paper, focusing on this problem, we propose an image inpainting network based on VQGAN and transformer to obtain high-quality multiple results. In the first stage, the vector quantization tokenization process, we introduce path-wise local spatial attention (PLSA) to reduce the fine-grained reconstruction losses, followed by a transformer network to generate the missing parts in the next second stage. Experiments on FFHQ and Places2 datasets show that our approach performs well in terms of image fidelity.
DOI:10.1109/HDIS56859.2022.9991576