Emotion-Aware Talking Face Generation Based on 3DMM
Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To s...
Saved in:
Published in | 2024 4th International Conference on Neural Networks, Information and Communication (NNICE) pp. 1808 - 1813 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
19.01.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/NNICE61279.2024.10498924 |
Cover
Loading…
Abstract | Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality. |
---|---|
AbstractList | Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although these methods have high generation quality and good audio-visual synchronization, they ignore facial expressions in talking face videos. To solve this problem, Audio to Expression Network (A2ENet), an emotional talking face video generation framework based on 3DMM, is proposed in this paper to generate talking face videos with facial expressions in an audio-driven way. Firstly, A2ENet uses two Transformer based encoders to extract audio features, and uses a cross-reconstruction emotion disentanglement method to decompose audio into potential space of content information and potential space of emotion information, and then uses a Transformer Decoder to integrate these two feature spaces. After that, the Proposed method predict the 3D expression coefficient that matches the emotion of the audio, and finally uses the renderer to generate the talking face video. By using the eye control parameters, A2ENet can realize the eye movements control of talking face. A2ENet associating the initial 3D expression coefficients with specific individuals to retain the identity information of the reference face. Experimental results show that our method can generate talking face videos with appropriate facial expressions, and achieve more accurate lip movements and better video quality. |
Author | Tang, Sheng Chen, Xinyu |
Author_xml | – sequence: 1 givenname: Xinyu surname: Chen fullname: Chen, Xinyu email: cxy808@gs.zzu.edu.cn organization: Zhengzhou University,Henan Institute of Advanced Technology,Zhengzhou,China – sequence: 2 givenname: Sheng surname: Tang fullname: Tang, Sheng email: ts@ict.ac.cn organization: Institute of Computing Technology,Chinese Academy of Sciences,Beijing,China |
BookMark | eNo1j81OwkAUhcdEFoq8AYt5gdZ756czd4m1IAngBtfkQu-YRmhNITG-vTXK6jvJSb6cc69u264VpTRCjgj0uNksy6pAEyg3YFyO4CiScTdqQoGi9WDJ2eDvlK1O3aXp2mz2xb3oLR8_mvZdz_kgeiGt9Pzb6ic-S62HYJ_X6wc1Snw8y-SfY_U2r7blS7Z6XSzL2SprEOmS7QmYOdraJgy-cABSGIrDLAcDaycJCX0KPiZGjrgHsIHcwZuCMXk7VtM_byMiu8--OXH_vbt-sT9oF0Dz |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/NNICE61279.2024.10498924 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9798350394375 |
EndPage | 1813 |
ExternalDocumentID | 10498924 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i119t-b90aaa83d3f1756400e629861240298d4ef1915f758fa1a81b003794c526a1f53 |
IEDL.DBID | RIE |
IngestDate | Wed May 01 11:58:52 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i119t-b90aaa83d3f1756400e629861240298d4ef1915f758fa1a81b003794c526a1f53 |
PageCount | 6 |
ParticipantIDs | ieee_primary_10498924 |
PublicationCentury | 2000 |
PublicationDate | 2024-Jan.-19 |
PublicationDateYYYYMMDD | 2024-01-19 |
PublicationDate_xml | – month: 01 year: 2024 text: 2024-Jan.-19 day: 19 |
PublicationDecade | 2020 |
PublicationTitle | 2024 4th International Conference on Neural Networks, Information and Communication (NNICE) |
PublicationTitleAbbrev | NNICE |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.8678215 |
Snippet | Current methods for generating videos of talking face based on deep learning mainly focus on the correlation between lip movements and audio content. Although... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1808 |
SubjectTerms | 3DMM Aerospace electronics content information Deep learning emotion information facial expressions Feature extraction lip movements Lips Quality assessment Three-dimensional displays Transformer Transformers |
Title | Emotion-Aware Talking Face Generation Based on 3DMM |
URI | https://ieeexplore.ieee.org/document/10498924 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB5sT55UjPgmB6-JeWy2u0etDVVo8NBCb2WzDxAllZJQ8Nc7kzSKguBt2SxkZ8PyzUzm-wbgRjouhaPqBkSTgBGJS5jEBSJV6HxjSBGrVu2z4NMFe1pmyx1ZveXCWGvb4jMb0rD9l2_WuqFUGd5wJgUGDAMYYOTWkbX66pxI3hbF43iCiD0iAkrCwn75j8YpLW7kB1D0b-zKRV7Dpi5D_fFLjPHfWzoE75ui5z9_gc8R7NnqGNJJ15QnuNuqjfXn6o0S4X6ucHGnL01P_XtELuPjIH2YzTxY5JP5eBrsuiIEL3Es66CUkVJKpCZ1CP0c76DliRRoNyM5dcOswxgscxgIOBUrdEtJY0YynSVcxS5LT2BYrSt7Cn5kdKYk15qTjlkWlS7RmiWlsSNydNgZeGTx6r0Tvlj1xp7_MX8B-3TwlKGI5SUM601jrxCz6_K6_VafCdiTcw |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEB60HvSkYsW3OXhNzGOzzR61tqTaBA8t9FY2-wBR2lISBH-9M0mjKAjelmTDZliWb2Z2vm8AboTlIrFU3YBo4jIicSU6tG4SSXS-MaQIZK32mfN0yh5n8WxDVq-5MMaYuvjMeDSs7_L1UlWUKsMTzkSCAcM27CDwM9HQtdr6HF_c5vmoP0DM7hEFJWRe-8GP1ik1cgz3IW_XbApGXr2qLDz18UuO8d8_dQDdb5Ke8_wFP4ewZRZHEA2atjzu3btcG2ci3ygV7gwlTm4Upumtc4_YpR0cRA9Z1oXpcDDpp-6mL4L7EgSidAvhSymTSEcWwZ_jKTQ8FAnazUhQXTNjMQqLLYYCVgYSHVNSmRFMxSGXgY2jY-gslgtzAo6vVSwFV4qTklnsFzZUioWFNj1yddgpdMni-aqRvpi3xp798fwadtNJNp6PR_nTOezRJlC-IhAX0CnXlblEBC-Lq3rfPgH-KZbD |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2024+4th+International+Conference+on+Neural+Networks%2C+Information+and+Communication+%28NNICE%29&rft.atitle=Emotion-Aware+Talking+Face+Generation+Based+on+3DMM&rft.au=Chen%2C+Xinyu&rft.au=Tang%2C+Sheng&rft.date=2024-01-19&rft.pub=IEEE&rft.spage=1808&rft.epage=1813&rft_id=info:doi/10.1109%2FNNICE61279.2024.10498924&rft.externalDocID=10498924 |