Talking Face Generation With Lip and Identity Priors

ABSTRACT Speech‐driven talking face video generation has attracted growing interest in recent research. While person‐specific approaches yield high‐fidelity results, they require extensive training data from each individual speaker. In contrast, general‐purpose methods often struggle with accurate l...

Full description

Saved in:

Bibliographic Details
Published in	Computer animation and virtual worlds Vol. 36; no. 3
Main Authors	Wu, Jiajie, Li, Frederick W. B., Tam, Gary K. L., Yang, Bailin, Nan, Fangzhe, Pan, Jiahao
Format	Journal Article
Language	English
Published	Hoboken, USA John Wiley & Sons, Inc 01.05.2025 Wiley Subscription Services, Inc
Subjects	Alignment lip and identity priors Optical flow (image analysis) Rendering Speech speech‐driven Synchronism Talking talking face generation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	ABSTRACT Speech‐driven talking face video generation has attracted growing interest in recent research. While person‐specific approaches yield high‐fidelity results, they require extensive training data from each individual speaker. In contrast, general‐purpose methods often struggle with accurate lip synchronization, identity preservation, and natural facial movements. To address these limitations, we propose a novel architecture that combines an alignment model with a rendering model. The rendering model synthesizes identity‐consistent lip movements by leveraging facial landmarks derived from speech, a partially occluded target face, multi‐reference lip features, and the input audio. Concurrently, the alignment model estimates optical flow using the occluded face and a static reference image, enabling precise alignment of facial poses and lip shapes. This collaborative design enhances the rendering process, resulting in more realistic and identity‐preserving outputs. Extensive experiments demonstrate that our method significantly improves lip synchronization and identity retention, establishing a new benchmark in talking face video generation. We propose a speech‐driven talking face generation framework that integrates optical flow‐based alignment and audio‐aware rendering with multi‐reference lip features. Our method effectively improves lip detail and identity preservation.
Bibliography:	Funding This work was supported by the Zhejiang Provincial Natural Science Foundation of China (Grant No. LD24F020003), the National Natural Science Foundation of China (Grant No. 62172366) and Major Sci‐Tech Innovation Project of Hangzhou City (2022AIZD0110). ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-4261 1546-427X
DOI:	10.1002/cav.70026