FRESA: Feedforward Reconstruction of Personalized Skinned Avatars from Few Images

We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their pract...

Full description

Saved in:

Bibliographic Details
Main Authors	Wang, Rong, Prada, Fabian, Wang, Ziyan, Jiang, Zhongshi, Yin, Chengxiang, Li, Junxuan, Saito, Shunsuke, Santesteban, Igor, Romero, Javier, Joshi, Rohan, Li, Hongdong, Saragih, Jason, Sheikh, Yaser
Format	Journal Article
Language	English
Published	24.03.2025
Subjects	Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present a novel method for reconstructing personalized 3D human avatars with realistic animation from only a few images. Due to the large variations in body shapes, poses, and cloth types, existing methods mostly require hours of per-subject optimization during inference, which limits their practical applications. In contrast, we learn a universal prior from over a thousand clothed humans to achieve instant feedforward generation and zero-shot generalization. Specifically, instead of rigging the avatar with shared skinning weights, we jointly infer personalized avatar shape, skinning weights, and pose-dependent deformations, which effectively improves overall geometric fidelity and reduces deformation artifacts. Moreover, to normalize pose variations and resolve coupled ambiguity between canonical shapes and skinning weights, we design a 3D canonicalization process to produce pixel-aligned initial conditions, which helps to reconstruct fine-grained geometric details. We then propose a multi-frame feature aggregation to robustly reduce artifacts introduced in canonicalization and fuse a plausible avatar preserving person-specific identities. Finally, we train the model in an end-to-end framework on a large-scale capture dataset, which contains diverse human subjects paired with high-quality 3D scans. Extensive experiments show that our method generates more authentic reconstruction and animation than state-of-the-arts, and can be directly generalized to inputs from casually taken phone photos. Project page and code is available at https://github.com/rongakowang/FRESA.
DOI:	10.48550/arxiv.2503.19207