Emotion-Aware Talking Face Generation Based on 3DMM
Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To s...
Saved in:
Published in | 2024 4th International Conference on Neural Networks, Information and Communication (NNICE) pp. 1808 - 1813 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
19.01.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality. |
---|---|
DOI: | 10.1109/NNICE61279.2024.10498924 |