Emotional 3D speech visualization from 2D audio visual data

Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accele...

Full description

Saved in:

Bibliographic Details
Published in	International journal of modeling, simulation and scientific computing Vol. 14; no. 5
Main Authors	Guillermo, Luis, Rojas, Jose-Maria, Ugarte, Willy
Format	Journal Article
Language	English
Published	Hackensack World Scientific Publishing Company 01.10.2023 World Scientific Publishing Co. Pte., Ltd
Subjects	Animation Audio data Computer vision Deep learning Emotion recognition Emotions Feature extraction Image reconstruction Machine learning Phonemes Research Article Software Speech processing Speech recognition Synchronism Visual tasks audio-visual speech Speech animation procedural animation
Online Access	Get full text
ISSN	1793-9623 1793-9615
DOI	10.1142/S1793962324500028

Cover

Loading…

Abstract	Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.
AbstractList	Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.
Author	Guillermo, Luis Rojas, Jose-Maria Ugarte, Willy
Author_xml	– sequence: 1 givenname: Luis surname: Guillermo fullname: Guillermo, Luis – sequence: 2 givenname: Jose-Maria surname: Rojas fullname: Rojas, Jose-Maria – sequence: 3 givenname: Willy surname: Ugarte fullname: Ugarte, Willy
BookMark	eNplkEtPwzAQhC1UJErpD-BmiXPAj9iOxQm15SFV4gCcI8dZC1dpXOwEBL-eRK166WlXO_OtRnOJJm1oAaFrSm4pzdndG1Waa8k4ywUhhBVnaDqeMi2pmBx3xi_QPKXNYCFCqELTKbpfbUPnQ2sazJc47QDsJ_72qTeN_zOjgl0MW8yW2PS1DwcN16YzV-jcmSbB_DBn6ONx9b54ztavTy-Lh3VmmZRFBoXV4ExuHKultgCK8KLKFSGutkJz0KoSRlJGKFAHVFSVLATJmRCgQBs-Qzf7v7sYvnpIXbkJfRwip5JTKSTlTPHBRfcuG0NKEVy5i35r4m9JSTnWVJ7UNDBkz_yE2NTJemg777w9oqfIP03pacI
Cites_doi	10.1109/CVPR.2019.01034 10.1145/3072959.3073699 10.1109/TASLP.2019.2935843 10.1007/978-3-642-12604-8_6 10.1109/TAFFC.2020.3022017 10.1145/3072959.3073658 10.1109/TMM.2018.2887027 10.1117/3.633187 10.1145/2897824.2925984 10.1145/3388767.3407339 10.3390/s21041249 10.1007/978-3-030-58589-1_42 10.1109/MRA.2012.2192811 10.1016/j.patcog.2020.107231 10.1109/TMM.2013.2279659 10.1145/3072959.3073640 10.1109/TMM.2017.2766843
ContentType	Journal Article
Copyright	2023, World Scientific Publishing Company 2023. World Scientific Publishing Company
Copyright_xml	– notice: 2023, World Scientific Publishing Company – notice: 2023. World Scientific Publishing Company
DBID	AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1142/S1793962324500028
DatabaseName	CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts CrossRef
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1793-9615
ExternalDocumentID	10_1142_S1793962324500028 S1793962324500028
GroupedDBID	0R~ 4.4 ADSJI ALMA_UNASSIGNED_HOLDINGS CAG COF EBS EJD HZ~ O9- RWJ AAYXX ADMLS CITATION 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3
ISSN	1793-9623
IngestDate	Mon Jun 30 12:45:57 EDT 2025 Tue Jul 01 04:02:18 EDT 2025 Fri Aug 23 08:19:37 EDT 2024
IsPeerReviewed	true
IsScholarly	true
Issue	5
Keywords	audio-visual speech Speech animation procedural animation
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-7510-618X
PQID	3165613273
PQPubID	2069573
ParticipantIDs	crossref_primary_10_1142_S1793962324500028 proquest_journals_3165613273 worldscientific_primary_S1793962324500028
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	20231000 2023-10-00 20231001
PublicationDateYYYYMMDD	2023-10-01
PublicationDate_xml	– month: 10 year: 2023 text: 20231000
PublicationDecade	2020
PublicationPlace	Hackensack
PublicationPlace_xml	– name: Hackensack
PublicationTitle	International journal of modeling, simulation and scientific computing
PublicationYear	2023
Publisher	World Scientific Publishing Company World Scientific Publishing Co. Pte., Ltd
Publisher_xml	– name: World Scientific Publishing Company – name: World Scientific Publishing Co. Pte., Ltd
References	S1793962324500028BIB007 Goodfellow I. (S1793962324500028BIB017) 2016 S1793962324500028BIB018 Brownlee J. (S1793962324500028BIB020) 2017 S1793962324500028BIB008 S1793962324500028BIB009 Brownlee J. (S1793962324500028BIB019) 2019 Zhang L. (S1793962324500028BIB024) 2016 S1793962324500028BIB003 S1793962324500028BIB014 S1793962324500028BIB004 S1793962324500028BIB015 S1793962324500028BIB005 S1793962324500028BIB016 S1793962324500028BIB006 S1793962324500028BIB010 S1793962324500028BIB021 S1793962324500028BIB022 S1793962324500028BIB001 S1793962324500028BIB023 Zhou Y. (S1793962324500028BIB012) 2018; 37 S1793962324500028BIB013
References_xml	– ident: S1793962324500028BIB009 doi: 10.1109/CVPR.2019.01034 – ident: S1793962324500028BIB006 doi: 10.1145/3072959.3073699 – ident: S1793962324500028BIB023 doi: 10.1109/TASLP.2019.2935843 – ident: S1793962324500028BIB022 doi: 10.1007/978-3-642-12604-8_6 – volume-title: Deep Learning year: 2016 ident: S1793962324500028BIB017 – ident: S1793962324500028BIB014 doi: 10.1109/TAFFC.2020.3022017 – volume-title: Generative Adversarial Networks with Python: Deep Learning Generative Models for Image Synthesis and Image Translation year: 2019 ident: S1793962324500028BIB019 – ident: S1793962324500028BIB007 doi: 10.1145/3072959.3073658 – ident: S1793962324500028BIB013 doi: 10.1109/TMM.2018.2887027 – ident: S1793962324500028BIB018 doi: 10.1117/3.633187 – start-page: 1 volume-title: 2016 IEEE Symp. Series on Computational Intelligence (SSCI) year: 2016 ident: S1793962324500028BIB024 – ident: S1793962324500028BIB005 doi: 10.1145/2897824.2925984 – ident: S1793962324500028BIB004 doi: 10.1145/3388767.3407339 – ident: S1793962324500028BIB016 doi: 10.3390/s21041249 – ident: S1793962324500028BIB008 doi: 10.1007/978-3-030-58589-1_42 – volume-title: Long Short-Term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning year: 2017 ident: S1793962324500028BIB020 – ident: S1793962324500028BIB001 doi: 10.1109/MRA.2012.2192811 – volume: 37 start-page: 1 year: 2018 ident: S1793962324500028BIB012 publication-title: ACM Trans. Graph. – ident: S1793962324500028BIB015 doi: 10.1016/j.patcog.2020.107231 – ident: S1793962324500028BIB021 doi: 10.1109/TMM.2013.2279659 – ident: S1793962324500028BIB003 doi: 10.1145/3072959.3073640 – ident: S1793962324500028BIB010 doi: 10.1109/TMM.2017.2766843
SSID	ssj0000557891
Score	2.2646186
Snippet	Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the...
SourceID	proquest crossref worldscientific
SourceType	Aggregation Database Index Database Publisher
SubjectTerms	Animation Audio data Computer vision Deep learning Emotion recognition Emotions Feature extraction Image reconstruction Machine learning Phonemes Research Article Software Speech processing Speech recognition Synchronism Visual tasks
Title	Emotional 3D speech visualization from 2D audio visual data
URI	http://www.worldscientific.com/doi/abs/10.1142/S1793962324500028 https://www.proquest.com/docview/3165613273
Volume	14
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NT9swFLdYuezCNrZp3djkAxdWhaWOndjaCS1MCAEXqMQtchxn6yRatDZI8Nfv-SNOSpkEXKLKVl5Vv1-ff35-HwjtAuWWnMkYmFstIwocIOIp1xEtuQLGoZiqbLXPs_RoQo8v2WV3FWOzS5blvrp7MK_kOVqFMdCryZJ9gmaDUBiAz6BfeIKG4fkoHR-6Hjywykk-WlxrrX6PbqYLkyd554MITfYIyUeyqaZzPzfy-WiBlK56BXu1JGybHN_0ZDG98p2-7H2DS6Q0cUY2Kr1ZtlugieVpbILhlfXCnjTTLpx-_keGe4foVLoQZzs1-WWCS23EH7x823dGkC6srbWf8HePROpSiPd1f8xlbQajS3vgYg_bcmpqw56b141EQpn1BXYbV3tZf28_C1GGLueaFGsiXqBNAqcKMkCbB_npyXlwypmCZK7LYvgh_iYc5Hxbk7PKZboDypatdtsposdYLl6jLX_UwAcON2_Qhp5to1dtGw_srfpb9D3ACCc5djDCKzDCBkaY5NjCyM9hA6N3aPLz8OLHUeR7akQKqBiPNFdC15LKmlSpAIYEFp-XNIvjulJMJFpkJZNAe-OxHtd6zMoy5cxYdqYzLWTyHg1m85n-gHAmUl2LTBIma8qYLCVNlc4SXkkR04QO0dd2dYprVzql-K9KhminXb_C43xRJKY01DgBhj1Ee_fWNIhcE_XxKd_7Cb3scLyDBsu_jf4MNHNZfvHg-AecQne7
linkProvider	EBSCOhost
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Emotional+3D+speech+visualization+from+2D+audio+visual+data&rft.jtitle=International+journal+of+modeling%2C+simulation+and+scientific+computing&rft.au=Guillermo%2C+Luis&rft.au=Rojas%2C+Jose-Maria&rft.au=Ugarte%2C+Willy&rft.date=2023-10-01&rft.issn=1793-9623&rft.eissn=1793-9615&rft.volume=14&rft.issue=5&rft_id=info:doi/10.1142%2FS1793962324500028&rft.externalDBID=n%2Fa&rft.externalDocID=10_1142_S1793962324500028
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1793-9623&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1793-9623&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1793-9623&client=summon