PuzzleAvatar: Assembling 3D Avatars from Personal Albums

Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if users could...

Full description

Saved in:

Bibliographic Details
Published in	ACM transactions on graphics Vol. 43; no. 6; pp. 1 - 15
Main Authors	Xiu, Yuliang, Ye, Yufei, Liu, Zhen, Tzionas, Dimitris, Black, Michael J.
Format	Journal Article
Language	English
Published	New York, NY, USA ACM 19.12.2024
Subjects	Appearance and texture representations Artificial intelligence Computer vision Computer vision problems Computer vision representations Computing methodologies Reconstruction Shape inference text-to-image diffusion model digital human image-based modeling text-guided 3d generation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if users could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into separate learned tokens, instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we create a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1k OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and demonstrating strong robustness. Our code and data are publicly available for research purpose at puzzleavatar.is.tue.mpg.de
ISSN:	0730-0301 1557-7368
DOI:	10.1145/3687771