SATFace: Subject Agnostic Talking Face Generation with Natural Head Movement

Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, r...

Full description

Saved in:

Bibliographic Details
Published in	Neural processing letters Vol. 55; no. 6; pp. 7529 - 7542
Main Authors	Yang, Shuai, Qiao, Kai, Shi, Shuhao, Yang, Jie, Ma, Dekui, Hu, Guoen, Yan, Bin, Chen, Jian
Format	Journal Article
Language	English
Published	New York Springer US 01.12.2023 Springer Nature B.V
Subjects	Artificial Intelligence Audio data Complex Systems Computational Intelligence Computer Science Deep learning Design Encoders-Decoders Feature extraction Head Head movement Methods Motion perception Mouth Realism Semantics Speech Synchronism Talking Talking face generation Feature learning Generative adversarial networks Encoder-decoder
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Talking face generation is widely used in education, entertainment, shopping, and other social practices. Existing methods focus on matching the speaker’s mouth shape with the speech content. Still, there is a lack of research on automatically extracting potential head motion features from speech, resulting in a lack of naturalness. This paper proposes SATFace, a subject agnostic talking face generation method with natural head movement. To model the talking face’s complicated and critical features (identity, background, mouth shape, head posture, etc.), we construct SATFace by taking encoder-decoder as the primary network architecture. Then, we design a long short-time feature learning network to better reference the global and local information in audio for generating reasonable head movement. Besides, a modular training process is proposed to improve explicit and implicit features’ learning effects and efficiency. The experimental comparison results show that SATFace improves by at least about 9.8% in cumulative probability of blur detection and 8.2% in synchronization confidence compared with the mainstream methods. The mean opinion scores show that SATFace has advantages in terms of lip sync quality, head movement naturalness, and video realness.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-023-11272-7