Emotion-Aware Talking Face Generation Based on 3DMM

Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To s...

Full description

Saved in:
Bibliographic Details
Published in2024 4th International Conference on Neural Networks, Information and Communication (NNICE) pp. 1808 - 1813
Main Authors Chen, Xinyu, Tang, Sheng
Format Conference Proceeding
LanguageEnglish
Published IEEE 19.01.2024
Subjects
Online AccessGet full text
DOI10.1109/NNICE61279.2024.10498924

Cover

Loading…
Abstract Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality.
AbstractList Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality.
Author Tang, Sheng
Chen, Xinyu
Author_xml – sequence: 1
  givenname: Xinyu
  surname: Chen
  fullname: Chen, Xinyu
  email: cxy808@gs.zzu.edu.cn
  organization: Zhengzhou University,Henan Institute of Advanced Technology,Zhengzhou,China
– sequence: 2
  givenname: Sheng
  surname: Tang
  fullname: Tang, Sheng
  email: ts@ict.ac.cn
  organization: Institute of Computing Technology,Chinese Academy of Sciences,Beijing,China
BookMark eNo1j81OwkAUhcdEFoq8AYt5gdZ756czd4m1IAngBtfkQu-YRmhNITG-vTXK6jvJSb6cc69u264VpTRCjgj0uNksy6pAEyg3YFyO4CiScTdqQoGi9WDJ2eDvlK1O3aXp2mz2xb3oLR8_mvZdz_kgeiGt9Pzb6ic-S62HYJ_X6wc1Snw8y-SfY_U2r7blS7Z6XSzL2SprEOmS7QmYOdraJgy-cABSGIrDLAcDaycJCX0KPiZGjrgHsIHcwZuCMXk7VtM_byMiu8--OXH_vbt-sT9oF0Dz
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/NNICE61279.2024.10498924
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350394375
EndPage 1813
ExternalDocumentID 10498924
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i119t-b90aaa83d3f1756400e629861240298d4ef1915f758fa1a81b003794c526a1f53
IEDL.DBID RIE
IngestDate Wed May 01 11:58:52 EDT 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i119t-b90aaa83d3f1756400e629861240298d4ef1915f758fa1a81b003794c526a1f53
PageCount 6
ParticipantIDs ieee_primary_10498924
PublicationCentury 2000
PublicationDate 2024-Jan.-19
PublicationDateYYYYMMDD 2024-01-19
PublicationDate_xml – month: 01
  year: 2024
  text: 2024-Jan.-19
  day: 19
PublicationDecade 2020
PublicationTitle 2024 4th International Conference on Neural Networks, Information and Communication (NNICE)
PublicationTitleAbbrev NNICE
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8678215
Snippet Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although...
SourceID ieee
SourceType Publisher
StartPage 1808
SubjectTerms 3DMM
Aerospace electronics
content information
Deep learning
emotion information
facial expressions
Feature extraction
lip movements
Lips
Quality assessment
Three-dimensional displays
Transformer
Transformers
Title Emotion-Aware Talking Face Generation Based on 3DMM
URI https://ieeexplore.ieee.org/document/10498924
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB5sT55UjPgmB6-JeWy2u0etDVVo8NBCb2WzDxAllZJQ8Nc7kzSKguBt2SxkZ8PyzUzm-wbgRjouhaPqBkSTgBGJS5jEBSJV6HxjSBGrVu2z4NMFe1pmyx1ZveXCWGvb4jMb0rD9l2_WuqFUGd5wJgUGDAMYYOTWkbX66pxI3hbF43iCiD0iAkrCwn75j8YpLW7kB1D0b-zKRV7Dpi5D_fFLjPHfWzoE75ui5z9_gc8R7NnqGNJJ15QnuNuqjfXn6o0S4X6ucHGnL01P_XtELuPjIH2YzTxY5JP5eBrsuiIEL3Es66CUkVJKpCZ1CP0c76DliRRoNyM5dcOswxgscxgIOBUrdEtJY0YynSVcxS5LT2BYrSt7Cn5kdKYk15qTjlkWlS7RmiWlsSNydNgZeGTx6r0Tvlj1xp7_MX8B-3TwlKGI5SUM601jrxCz6_K6_VafCdiTcw
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB60HvSkYsW3OXhNzGOzzR61tqTaBA8t9FY2-wBR2lISBH-9M0mjKAjelmTDZliWb2Z2vm8AboTlIrFU3YBo4jIicSU6tG4SSXS-MaQIZK32mfN0yh5n8WxDVq-5MMaYuvjMeDSs7_L1UlWUKsMTzkSCAcM27CDwM9HQtdr6HF_c5vmoP0DM7hEFJWRe-8GP1ik1cgz3IW_XbApGXr2qLDz18UuO8d8_dQDdb5Ke8_wFP4ewZRZHEA2atjzu3btcG2ci3ygV7gwlTm4Upumtc4_YpR0cRA9Z1oXpcDDpp-6mL4L7EgSidAvhSymTSEcWwZ_jKTQ8FAnazUhQXTNjMQqLLYYCVgYSHVNSmRFMxSGXgY2jY-gslgtzAo6vVSwFV4qTklnsFzZUioWFNj1yddgpdMni-aqRvpi3xp798fwadtNJNp6PR_nTOezRJlC-IhAX0CnXlblEBC-Lq3rfPgH-KZbD
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+4th+International+Conference+on+Neural+Networks%2C+Information+and+Communication+%28NNICE%29&rft.atitle=Emotion-Aware+Talking+Face+Generation+Based+on+3DMM&rft.au=Chen%2C+Xinyu&rft.au=Tang%2C+Sheng&rft.date=2024-01-19&rft.pub=IEEE&rft.spage=1808&rft.epage=1813&rft_id=info:doi/10.1109%2FNNICE61279.2024.10498924&rft.externalDocID=10498924