Disentangling audio content and emotion with adaptive instance normalization for expressive facial animation synthesis

3D facial animation synthesis from audio has been a focus in recent years. However, most existing literature works are designed to map audio and visual content, providing limited knowledge regarding the relationship between emotion in audio and expressive facial animation. This work generates audio‐...

Full description

Saved in:

Bibliographic Details
Published in	Computer animation and virtual worlds Vol. 33; no. 3-4
Main Authors	Chang, Che‐Jui, Zhao, Long, Zhang, Sen, Kapadia, Mubbasir
Format	Journal Article
Language	English
Published	Chichester Wiley Subscription Services, Inc 01.06.2022
Subjects	adaptive instance normalization Animation Apexes audio‐driven animation content‐emotion disentanglement Embedding Emotions emotion‐conditioning expressive facial animation synthesis Segmentation Synthesis Texture mapping
Online Access	Get full text

Cover

Loading…

More Information
Summary:	3D facial animation synthesis from audio has been a focus in recent years. However, most existing literature works are designed to map audio and visual content, providing limited knowledge regarding the relationship between emotion in audio and expressive facial animation. This work generates audio‐matching facial animations with the specified emotion label. In such a task, we argue that separating the content from audio is indispensable—the proposed model must learn to generate facial content from audio content while expressions from the specified emotion. We achieve it by an adaptive instance normalization module that isolates the content in the audio and combines the emotion embedding from the specified label. The joint content‐emotion embedding is then used to generate 3D facial vertices and texture maps. We compare our method with state‐of‐the‐art baselines, including the facial segmentation‐based and voice conversion‐based disentanglement approaches. We also conduct a user study to evaluate the performance of emotion conditioning. The results indicate that our proposed method outperforms the baselines in animation quality and expression categorization accuracy. We apply adaptive instance normalization for emotion conditioning for expressive facial animation synthesis. Our proposed model successfully disentangles audio content and emotion, and entangles audio content with the specified emotion. We show in our experiment that our method outperforms three competitive baselines.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-4261 1546-427X
DOI:	10.1002/cav.2076