A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with e...

Full description

Saved in:
Bibliographic Details
Published in2020 25th International Conference on Pattern Recognition (ICPR) pp. 5286 - 5293
Main Authors Zheng, Ruobing, Zhu, Zhou, Song, Bo, Ji, Changjiang
Format Conference Proceeding
LanguageEnglish
Published IEEE 10.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with existing methods. In this paper, we present a novel lip-sync framework specially designed for producing high fidelity virtual news anchors. A pair of Temporal Convolutional Networks are used to learn the cross-modal sequential mapping from audio signals to mouth movements, followed by a neural rendering network that translates the synthetic facial map into high-resolution and photorealistic appearance. This fully-trainable framework provides an end-to-end processing that outperforms traditional graphics-based methods in many low-delay applications. Experiments also show the framework has advantages over modern neural-based methods in both visual appearance and efficiency.
DOI:10.1109/ICPR48806.2021.9412187