Wav2Lip‐HR: Synthesising clear high‐resolution talking head in the wild

Talking head generation aims to synthesize a photo‐realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio‐visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this...

Full description

Saved in:

Bibliographic Details
Published in	Computer animation and virtual worlds Vol. 35; no. 1
Main Authors	Liang, Chao, Wang, Qinghua, Chen, Yunlin, Tang, Minjie
Format	Journal Article
Language	English
Published	Chichester Wiley Subscription Services, Inc 01.01.2024
Subjects	audio‐driven cross modal Data augmentation Head Performance evaluation Synchronism Talking talking‐head generation Video visual quality
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Talking head generation aims to synthesize a photo‐realistic speaking video with accurate lip motion. While this field has attracted more attention in recent audio‐visual researches, most existing methods do not achieve the simultaneous improvement of lip synchronization and visual quality. In this paper, we propose Wav2Lip‐HR, a neural‐based audio‐driven high‐resolution talking head generation method. With our technique, all required to generate a clear high‐resolution lip sync talking video is an image/video of the target face and an audio clip of any speech. The primary benefit of our method is that it generates clear high‐resolution videos with sufficient facial details, rather than the ones just be large‐sized with less clarity. We first analyze key factors that limit the clarity of generated videos and then put forth several important solutions to address the problem, including data augmentation, model structure improvement and a more effective loss function. Finally, we employ several efficient metrics to evaluate the clarity of images generated by our proposed approach as well as several widely used metrics to evaluate lip‐sync performance. Numerous experiments demonstrate that our method has superior performance on visual quality and lip synchronization when compared to other existing schemes. Our proposed Wav2Lip‐HR produces clear, high‐resolution talking videos in real‐time. All required is a portrait and a clip of speech, and the generated video is completely matched with the input audio.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-4261 1546-427X
DOI:	10.1002/cav.2226