Expressive Talking Head Generation with Granular Audio-Visual Control

Generating expressive talking heads is essential for creating virtual humans. However, existing one- or few-shot methods focus on lip-sync and head motion, ignoring the emotional expressions that make talking faces realistic. In this paper, we propose the Granularly Controlled Audio-Visual Talking H...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3377 - 3386
Main Authors Liang, Borong, Pan, Yan, Guo, Zhizhi, Zhou, Hang, Hong, Zhibin, Han, Xiaoguang, Han, Junyu, Liu, Jingtuo, Ding, Errui, Wang, Jingdong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Generating expressive talking heads is essential for creating virtual humans. However, existing one- or few-shot methods focus on lip-sync and head motion, ignoring the emotional expressions that make talking faces realistic. In this paper, we propose the Granularly Controlled Audio-Visual Talking Heads (GC-AVT), which controls lip movements, head poses, and facial expressions of a talking head in a granular manner. Our insight is to decouple the audio-visual driving sources through prior-based pre-processing designs. Detailedly, we disassemble the driving image into three complementary parts including: 1) a cropped mouth that facilitates lip-sync; 2) a masked head that implicitly learns pose; and 3) the upper face which works corporately and complementarily with a time-shifted mouth to contribute the expression. Interestingly, the encoded features from the three sources are integrally balanced through reconstruction training. Extensive experiments show that our method generates expressive faces with not only synced mouth shapes, controllable poses, but precisely animated emotional expressions as well.
ISSN:2575-7075
DOI:10.1109/CVPR52688.2022.00338