Emotion-Aware Talking Face Generation Based on 3DMM

Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To s...

Full description

Saved in:

Bibliographic Details
Published in	2024 4th International Conference on Neural Networks, Information and Communication (NNICE) pp. 1808 - 1813
Main Authors	Chen, Xinyu, Tang, Sheng
Format	Conference Proceeding
Language	English
Published	IEEE 19.01.2024
Subjects	3DMM Aerospace electronics content information Deep learning emotion information facial expressions Feature extraction lip movements Lips Quality assessment Three-dimensional displays Transformer Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality.
DOI:	10.1109/NNICE61279.2024.10498924