Talking Face Generation With Lip and Identity Priors

ABSTRACT Speech‐driven talking face video generation has attracted growing interest in recent research. While person‐specific approaches yield high‐fidelity results, they require extensive training data from each individual speaker. In contrast, general‐purpose methods often struggle with accurate l...

Full description

Saved in:
Bibliographic Details
Published inComputer animation and virtual worlds Vol. 36; no. 3
Main Authors Wu, Jiajie, Li, Frederick W. B., Tam, Gary K. L., Yang, Bailin, Nan, Fangzhe, Pan, Jiahao
Format Journal Article
LanguageEnglish
Published Hoboken, USA John Wiley & Sons, Inc 01.05.2025
Wiley Subscription Services, Inc
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:ABSTRACT Speech‐driven talking face video generation has attracted growing interest in recent research. While person‐specific approaches yield high‐fidelity results, they require extensive training data from each individual speaker. In contrast, general‐purpose methods often struggle with accurate lip synchronization, identity preservation, and natural facial movements. To address these limitations, we propose a novel architecture that combines an alignment model with a rendering model. The rendering model synthesizes identity‐consistent lip movements by leveraging facial landmarks derived from speech, a partially occluded target face, multi‐reference lip features, and the input audio. Concurrently, the alignment model estimates optical flow using the occluded face and a static reference image, enabling precise alignment of facial poses and lip shapes. This collaborative design enhances the rendering process, resulting in more realistic and identity‐preserving outputs. Extensive experiments demonstrate that our method significantly improves lip synchronization and identity retention, establishing a new benchmark in talking face video generation. We propose a speech‐driven talking face generation framework that integrates optical flow‐based alignment and audio‐aware rendering with multi‐reference lip features. Our method effectively improves lip detail and identity preservation.
Bibliography:Funding
This work was supported by the Zhejiang Provincial Natural Science Foundation of China (Grant No. LD24F020003), the National Natural Science Foundation of China (Grant No. 62172366) and Major Sci‐Tech Innovation Project of Hangzhou City (2022AIZD0110).
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1546-4261
1546-427X
DOI:10.1002/cav.70026