Emotional 3D speech visualization from 2D audio visual data
Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accele...
Saved in:
Published in | International journal of modeling, simulation and scientific computing Vol. 14; no. 5 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Hackensack
World Scientific Publishing Company
01.10.2023
World Scientific Publishing Co. Pte., Ltd |
Subjects | |
Online Access | Get full text |
ISSN | 1793-9623 1793-9615 |
DOI | 10.1142/S1793962324500028 |
Cover
Loading…
Abstract | Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages. |
---|---|
AbstractList | Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages. |
Author | Guillermo, Luis Rojas, Jose-Maria Ugarte, Willy |
Author_xml | – sequence: 1 givenname: Luis surname: Guillermo fullname: Guillermo, Luis – sequence: 2 givenname: Jose-Maria surname: Rojas fullname: Rojas, Jose-Maria – sequence: 3 givenname: Willy surname: Ugarte fullname: Ugarte, Willy |
BookMark | eNplkEtPwzAQhC1UJErpD-BmiXPAj9iOxQm15SFV4gCcI8dZC1dpXOwEBL-eRK166WlXO_OtRnOJJm1oAaFrSm4pzdndG1Waa8k4ywUhhBVnaDqeMi2pmBx3xi_QPKXNYCFCqELTKbpfbUPnQ2sazJc47QDsJ_72qTeN_zOjgl0MW8yW2PS1DwcN16YzV-jcmSbB_DBn6ONx9b54ztavTy-Lh3VmmZRFBoXV4ExuHKultgCK8KLKFSGutkJz0KoSRlJGKFAHVFSVLATJmRCgQBs-Qzf7v7sYvnpIXbkJfRwip5JTKSTlTPHBRfcuG0NKEVy5i35r4m9JSTnWVJ7UNDBkz_yE2NTJemg777w9oqfIP03pacI |
Cites_doi | 10.1109/CVPR.2019.01034 10.1145/3072959.3073699 10.1109/TASLP.2019.2935843 10.1007/978-3-642-12604-8_6 10.1109/TAFFC.2020.3022017 10.1145/3072959.3073658 10.1109/TMM.2018.2887027 10.1117/3.633187 10.1145/2897824.2925984 10.1145/3388767.3407339 10.3390/s21041249 10.1007/978-3-030-58589-1_42 10.1109/MRA.2012.2192811 10.1016/j.patcog.2020.107231 10.1109/TMM.2013.2279659 10.1145/3072959.3073640 10.1109/TMM.2017.2766843 |
ContentType | Journal Article |
Copyright | 2023, World Scientific Publishing Company 2023. World Scientific Publishing Company |
Copyright_xml | – notice: 2023, World Scientific Publishing Company – notice: 2023. World Scientific Publishing Company |
DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1142/S1793962324500028 |
DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1793-9615 |
ExternalDocumentID | 10_1142_S1793962324500028 S1793962324500028 |
GroupedDBID | 0R~ 4.4 ADSJI ALMA_UNASSIGNED_HOLDINGS CAG COF EBS EJD HZ~ O9- RWJ AAYXX ADMLS CITATION 7SC 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3 |
ISSN | 1793-9623 |
IngestDate | Mon Jun 30 12:45:57 EDT 2025 Tue Jul 01 04:02:18 EDT 2025 Fri Aug 23 08:19:37 EDT 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Keywords | audio-visual speech Speech animation procedural animation |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0002-7510-618X |
PQID | 3165613273 |
PQPubID | 2069573 |
ParticipantIDs | crossref_primary_10_1142_S1793962324500028 proquest_journals_3165613273 worldscientific_primary_S1793962324500028 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 20231000 2023-10-00 20231001 |
PublicationDateYYYYMMDD | 2023-10-01 |
PublicationDate_xml | – month: 10 year: 2023 text: 20231000 |
PublicationDecade | 2020 |
PublicationPlace | Hackensack |
PublicationPlace_xml | – name: Hackensack |
PublicationTitle | International journal of modeling, simulation and scientific computing |
PublicationYear | 2023 |
Publisher | World Scientific Publishing Company World Scientific Publishing Co. Pte., Ltd |
Publisher_xml | – name: World Scientific Publishing Company – name: World Scientific Publishing Co. Pte., Ltd |
References | S1793962324500028BIB007 Goodfellow I. (S1793962324500028BIB017) 2016 S1793962324500028BIB018 Brownlee J. (S1793962324500028BIB020) 2017 S1793962324500028BIB008 S1793962324500028BIB009 Brownlee J. (S1793962324500028BIB019) 2019 Zhang L. (S1793962324500028BIB024) 2016 S1793962324500028BIB003 S1793962324500028BIB014 S1793962324500028BIB004 S1793962324500028BIB015 S1793962324500028BIB005 S1793962324500028BIB016 S1793962324500028BIB006 S1793962324500028BIB010 S1793962324500028BIB021 S1793962324500028BIB022 S1793962324500028BIB001 S1793962324500028BIB023 Zhou Y. (S1793962324500028BIB012) 2018; 37 S1793962324500028BIB013 |
References_xml | – ident: S1793962324500028BIB009 doi: 10.1109/CVPR.2019.01034 – ident: S1793962324500028BIB006 doi: 10.1145/3072959.3073699 – ident: S1793962324500028BIB023 doi: 10.1109/TASLP.2019.2935843 – ident: S1793962324500028BIB022 doi: 10.1007/978-3-642-12604-8_6 – volume-title: Deep Learning year: 2016 ident: S1793962324500028BIB017 – ident: S1793962324500028BIB014 doi: 10.1109/TAFFC.2020.3022017 – volume-title: Generative Adversarial Networks with Python: Deep Learning Generative Models for Image Synthesis and Image Translation year: 2019 ident: S1793962324500028BIB019 – ident: S1793962324500028BIB007 doi: 10.1145/3072959.3073658 – ident: S1793962324500028BIB013 doi: 10.1109/TMM.2018.2887027 – ident: S1793962324500028BIB018 doi: 10.1117/3.633187 – start-page: 1 volume-title: 2016 IEEE Symp. Series on Computational Intelligence (SSCI) year: 2016 ident: S1793962324500028BIB024 – ident: S1793962324500028BIB005 doi: 10.1145/2897824.2925984 – ident: S1793962324500028BIB004 doi: 10.1145/3388767.3407339 – ident: S1793962324500028BIB016 doi: 10.3390/s21041249 – ident: S1793962324500028BIB008 doi: 10.1007/978-3-030-58589-1_42 – volume-title: Long Short-Term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning year: 2017 ident: S1793962324500028BIB020 – ident: S1793962324500028BIB001 doi: 10.1109/MRA.2012.2192811 – volume: 37 start-page: 1 year: 2018 ident: S1793962324500028BIB012 publication-title: ACM Trans. Graph. – ident: S1793962324500028BIB015 doi: 10.1016/j.patcog.2020.107231 – ident: S1793962324500028BIB021 doi: 10.1109/TMM.2013.2279659 – ident: S1793962324500028BIB003 doi: 10.1145/3072959.3073640 – ident: S1793962324500028BIB010 doi: 10.1109/TMM.2017.2766843 |
SSID | ssj0000557891 |
Score | 2.2646186 |
Snippet | Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the... |
SourceID | proquest crossref worldscientific |
SourceType | Aggregation Database Index Database Publisher |
SubjectTerms | Animation Audio data Computer vision Deep learning Emotion recognition Emotions Feature extraction Image reconstruction Machine learning Phonemes Research Article Software Speech processing Speech recognition Synchronism Visual tasks |
Title | Emotional 3D speech visualization from 2D audio visual data |
URI | http://www.worldscientific.com/doi/abs/10.1142/S1793962324500028 https://www.proquest.com/docview/3165613273 |
Volume | 14 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NT9swFLdYuezCNrZp3djkAxdWhaWOndjaCS1MCAEXqMQtchxn6yRatDZI8Nfv-SNOSpkEXKLKVl5Vv1-ff35-HwjtAuWWnMkYmFstIwocIOIp1xEtuQLGoZiqbLXPs_RoQo8v2WV3FWOzS5blvrp7MK_kOVqFMdCryZJ9gmaDUBiAz6BfeIKG4fkoHR-6Hjywykk-WlxrrX6PbqYLkyd554MITfYIyUeyqaZzPzfy-WiBlK56BXu1JGybHN_0ZDG98p2-7H2DS6Q0cUY2Kr1ZtlugieVpbILhlfXCnjTTLpx-_keGe4foVLoQZzs1-WWCS23EH7x823dGkC6srbWf8HePROpSiPd1f8xlbQajS3vgYg_bcmpqw56b141EQpn1BXYbV3tZf28_C1GGLueaFGsiXqBNAqcKMkCbB_npyXlwypmCZK7LYvgh_iYc5Hxbk7PKZboDypatdtsposdYLl6jLX_UwAcON2_Qhp5to1dtGw_srfpb9D3ACCc5djDCKzDCBkaY5NjCyM9hA6N3aPLz8OLHUeR7akQKqBiPNFdC15LKmlSpAIYEFp-XNIvjulJMJFpkJZNAe-OxHtd6zMoy5cxYdqYzLWTyHg1m85n-gHAmUl2LTBIma8qYLCVNlc4SXkkR04QO0dd2dYprVzql-K9KhminXb_C43xRJKY01DgBhj1Ee_fWNIhcE_XxKd_7Cb3scLyDBsu_jf4MNHNZfvHg-AecQne7 |
linkProvider | EBSCOhost |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Emotional+3D+speech+visualization+from+2D+audio+visual+data&rft.jtitle=International+journal+of+modeling%2C+simulation+and+scientific+computing&rft.au=Guillermo%2C+Luis&rft.au=Rojas%2C+Jose-Maria&rft.au=Ugarte%2C+Willy&rft.date=2023-10-01&rft.issn=1793-9623&rft.eissn=1793-9615&rft.volume=14&rft.issue=5&rft_id=info:doi/10.1142%2FS1793962324500028&rft.externalDBID=n%2Fa&rft.externalDocID=10_1142_S1793962324500028 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1793-9623&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1793-9623&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1793-9623&client=summon |