VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment

We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output, suitable for 3D teleconferencing systems based on holograph...

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10336 - 10348
Main Authors	Tran, Phong, Zakharov, Egor, Ho, Long-Nhat, Tran, Anh Tuan, Hu, Liwen, Li, Hao
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Computer vision deepfakes ethz Face recognition generativeai Head head-reenactment holographicdisplay mbzuai metaverse nerf neural-avatar pinscreen Self-supervised learning Solid modeling Teleconferencing Three-dimensional displays triplane vinai voodoo3d
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output, suitable for 3D teleconferencing systems based on holographic displays. Existing cutting-edge 3D-aware reenactment methods often use neural radiance fields or 3D meshes to produce view-consistent appearance encoding, but, at the same time, they rely on linear face models, such as 3DMM, to achieve its disentanglement with facial expressions. As a result, their reenactment results often exhibit identity leakage from the driver or have unnatural expressions. To address these problems, we propose a neural self-supervised disentanglement approach that lifts both the source image and driver video frame into a shared 3D volumetric representation based on tri-planes. This representation can then be freely manipu-lated with expression tri-planes extracted from the driving images and rendered from an arbitrary view using neural radiance fields. We achieve this disentanglement via self-supervised learning on a large in-the-wild video dataset. We further introduce a highly effective fine-tuning approach to improve the generalizability of the 3D lifting using the same real-world data. We demonstrate state-of-the-art performance on a wide range of datasets, and also showcase high-quality 3D-aware head reenactment on highly challenging and diverse subjects, including non-frontal head poses and complex expressions for both source and driver.
ISSN:	2575-7075
DOI:	10.1109/CVPR52733.2024.00984