A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with e...

Full description

Saved in:

Bibliographic Details
Published in	2020 25th International Conference on Pattern Recognition (ICPR) pp. 5286 - 5293
Main Authors	Zheng, Ruobing, Zhu, Zhou, Song, Bo, Ji, Changjiang
Format	Conference Proceeding
Language	English
Published	IEEE 10.01.2021
Subjects	Convolution Deep learning lip sync Lips Mouth Neural networks neural rendering Rendering (computer graphics) temporal convolutional networks virtual anchor Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Lip sync has emerged as a promising technique for generating mouth movements from audio signals. However, synthesizing a high-resolution and photorealistic virtual news anchor is still challenging. Lack of natural appearance, visual consistency, and processing efficiency are the main problems with existing methods. In this paper, we present a novel lip-sync framework specially designed for producing high fidelity virtual news anchors. A pair of Temporal Convolutional Networks are used to learn the cross-modal sequential mapping from audio signals to mouth movements, followed by a neural rendering network that translates the synthetic facial map into high-resolution and photorealistic appearance. This fully-trainable framework provides an end-to-end processing that outperforms traditional graphics-based methods in many low-delay applications. Experiments also show the framework has advantages over modern neural-based methods in both visual appearance and efficiency.
DOI:	10.1109/ICPR48806.2021.9412187