Expressive Talking Head Generation with Granular Audio-Visual Control
Generating expressive talking heads is essential for creating virtual humans. However, existing one- or few-shot methods focus on lip-sync and head motion, ignoring the emotional expressions that make talking faces realistic. In this paper, we propose the Granularly Controlled Audio-Visual Talking H...
Saved in:
Published in | 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 3377 - 3386 |
---|---|
Main Authors | , , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Generating expressive talking heads is essential for creating virtual humans. However, existing one- or few-shot methods focus on lip-sync and head motion, ignoring the emotional expressions that make talking faces realistic. In this paper, we propose the Granularly Controlled Audio-Visual Talking Heads (GC-AVT), which controls lip movements, head poses, and facial expressions of a talking head in a granular manner. Our insight is to decouple the audio-visual driving sources through prior-based pre-processing designs. Detailedly, we disassemble the driving image into three complementary parts including: 1) a cropped mouth that facilitates lip-sync; 2) a masked head that implicitly learns pose; and 3) the upper face which works corporately and complementarily with a time-shifted mouth to contribute the expression. Interestingly, the encoded features from the three sources are integrally balanced through reconstruction training. Extensive experiments show that our method generates expressive faces with not only synced mouth shapes, controllable poses, but precisely animated emotional expressions as well. |
---|---|
ISSN: | 2575-7075 |
DOI: | 10.1109/CVPR52688.2022.00338 |