Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF
Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the...
Saved in:
Published in | Image and vision computing Vol. 148; p. 105104 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio-synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high-frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio-synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross-modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN-based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi-level SyncNet loss for accurate lip synchronization. We also propose a novel cross-attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high-frequency details. We demonstrate that the proposed method renders realistic and audio-synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+4.7%), SSIM (+2.2%), LMD (+51.3%), and SyncNet Confidence (+154.7%) compared to the NeRF-based current state-of-the-art methods.
•Present a novel Cross-Attention module for better lip synchronization.•Present a Multi-level SyncNet Loss for mitigating audio-visual trade-off.•Analysis of audio-visual trade-off and Wavelet transform based loss to address them.•We demonstrate superior results on the widely used talking head generation datasets. |
---|---|
ISSN: | 0262-8856 1872-8138 |
DOI: | 10.1016/j.imavis.2024.105104 |