Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF

Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the...

Full description

Saved in:
Bibliographic Details
Published inImage and vision computing Vol. 148; p. 105104
Main Authors Shin, Ah-Hyung, Lee, Jae-Ho, Hwang, Jiwon, Kim, Yoonhyung, Park, Gyeong-Moon
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.08.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Talking head generation is an essential task in various real-world applications such as film making and virtual reality. To this end, recent works focus on the NeRF-based methods that can capture the 3D structural information of faces and generate more natural and vivid talking videos. However, the existing NeRF-based methods fail to accurately generate the audio-synced videos. In this paper, we point out that the previous methods do not consider the audio-visual representations explicitly, which is crucial for precise lip synchronization. Moreover, the existing methods struggle to generate high-frequency details, making the generation results unnatural. To overcome these problems, we propose a novel audio-synced and high-fidelity NeRF-based talking head generation framework, named Wav2NeRF, which learns audio-visual cross-modality representations and employs the wavelet transform for better visual quality. In precise, we adopt a 2D CNN-based neural rendering decoder to a NeRF-based encoder for fast generation of the whole image to employ a new multi-level SyncNet loss for accurate lip synchronization. We also propose a novel cross-attention module to effectively fuse the image and the audio representation. In addition, we integrate the wavelet transform into our framework by proposing the wavelet loss function to enhance high-frequency details. We demonstrate that the proposed method renders realistic and audio-synced talking head videos and shows outstanding performances on average in 4 representative metrics, including PSNR (+4.7%), SSIM (+2.2%), LMD (+51.3%), and SyncNet Confidence (+154.7%) compared to the NeRF-based current state-of-the-art methods. •Present a novel Cross-Attention module for better lip synchronization.•Present a Multi-level SyncNet Loss for mitigating audio-visual trade-off.•Analysis of audio-visual trade-off and Wavelet transform based loss to address them.•We demonstrate superior results on the widely used talking head generation datasets.
ISSN:0262-8856
1872-8138
DOI:10.1016/j.imavis.2024.105104