Emotional 3D speech visualization from 2D audio visual data

Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accele...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of modeling, simulation and scientific computing Vol. 14; no. 5
Main Authors Guillermo, Luis, Rojas, Jose-Maria, Ugarte, Willy
Format Journal Article
LanguageEnglish
Published Hackensack World Scientific Publishing Company 01.10.2023
World Scientific Publishing Co. Pte., Ltd
Subjects
Online AccessGet full text
ISSN1793-9623
1793-9615
DOI10.1142/S1793962324500028

Cover

Loading…
Abstract Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.
AbstractList Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the expectations of the developers, but above all, those of the audience. To solve this problem, some approaches has been designed to help accelerate the animation of characters faces, as procedural animation or speech-lip synchronization, where the most common areas for researching these methods are Computer Vision and Machine Learning. However, in general, these tools can have any of these main problems: difficulty on adapting to another language, subject or animation software, high hardware specifications, or the results can be receipted as robotic. Our work presents a Deep Learning model for automatic expressive facial animation using audio. We extract generic audio features from expressive audio speeches rich in phonemes for nonidiom focus speech processing and emotion recognition. From videos used for training, we extracted the landmarks for frame-speech targeting and have the model learn animation for phonemes pronunciation. We evaluated four variants of our model (two function losses and with emotion conditioning) by using a user perspective survey where the one using a Reconstruction Loss Function with emotion training conditioning got more natural results and score in synchronization with the approval of the majority of interviewees. For perception of naturalness, it obtained a 38.89% of the total votes of approval and for language synchronization obtained the highest average score with 65.55% (98.33 of a 150 total points) for English, German and Korean languages.
Author Guillermo, Luis
Rojas, Jose-Maria
Ugarte, Willy
Author_xml – sequence: 1
  givenname: Luis
  surname: Guillermo
  fullname: Guillermo, Luis
– sequence: 2
  givenname: Jose-Maria
  surname: Rojas
  fullname: Rojas, Jose-Maria
– sequence: 3
  givenname: Willy
  surname: Ugarte
  fullname: Ugarte, Willy
BookMark eNplkEtPwzAQhC1UJErpD-BmiXPAj9iOxQm15SFV4gCcI8dZC1dpXOwEBL-eRK166WlXO_OtRnOJJm1oAaFrSm4pzdndG1Waa8k4ywUhhBVnaDqeMi2pmBx3xi_QPKXNYCFCqELTKbpfbUPnQ2sazJc47QDsJ_72qTeN_zOjgl0MW8yW2PS1DwcN16YzV-jcmSbB_DBn6ONx9b54ztavTy-Lh3VmmZRFBoXV4ExuHKultgCK8KLKFSGutkJz0KoSRlJGKFAHVFSVLATJmRCgQBs-Qzf7v7sYvnpIXbkJfRwip5JTKSTlTPHBRfcuG0NKEVy5i35r4m9JSTnWVJ7UNDBkz_yE2NTJemg777w9oqfIP03pacI
Cites_doi 10.1109/CVPR.2019.01034
10.1145/3072959.3073699
10.1109/TASLP.2019.2935843
10.1007/978-3-642-12604-8_6
10.1109/TAFFC.2020.3022017
10.1145/3072959.3073658
10.1109/TMM.2018.2887027
10.1117/3.633187
10.1145/2897824.2925984
10.1145/3388767.3407339
10.3390/s21041249
10.1007/978-3-030-58589-1_42
10.1109/MRA.2012.2192811
10.1016/j.patcog.2020.107231
10.1109/TMM.2013.2279659
10.1145/3072959.3073640
10.1109/TMM.2017.2766843
ContentType Journal Article
Copyright 2023, World Scientific Publishing Company
2023. World Scientific Publishing Company
Copyright_xml – notice: 2023, World Scientific Publishing Company
– notice: 2023. World Scientific Publishing Company
DBID AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1142/S1793962324500028
DatabaseName CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts
CrossRef

DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1793-9615
ExternalDocumentID 10_1142_S1793962324500028
S1793962324500028
GroupedDBID 0R~
4.4
ADSJI
ALMA_UNASSIGNED_HOLDINGS
CAG
COF
EBS
EJD
HZ~
O9-
RWJ
AAYXX
ADMLS
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3
ISSN 1793-9623
IngestDate Mon Jun 30 12:45:57 EDT 2025
Tue Jul 01 04:02:18 EDT 2025
Fri Aug 23 08:19:37 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 5
Keywords audio-visual speech
Speech animation
procedural animation
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c2668-e8c9efa4af2d69cee7038b4700fdc593e97b5a61201e1fe15bb68504255e7e9a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-7510-618X
PQID 3165613273
PQPubID 2069573
ParticipantIDs crossref_primary_10_1142_S1793962324500028
proquest_journals_3165613273
worldscientific_primary_S1793962324500028
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 20231000
2023-10-00
20231001
PublicationDateYYYYMMDD 2023-10-01
PublicationDate_xml – month: 10
  year: 2023
  text: 20231000
PublicationDecade 2020
PublicationPlace Hackensack
PublicationPlace_xml – name: Hackensack
PublicationTitle International journal of modeling, simulation and scientific computing
PublicationYear 2023
Publisher World Scientific Publishing Company
World Scientific Publishing Co. Pte., Ltd
Publisher_xml – name: World Scientific Publishing Company
– name: World Scientific Publishing Co. Pte., Ltd
References S1793962324500028BIB007
Goodfellow I. (S1793962324500028BIB017) 2016
S1793962324500028BIB018
Brownlee J. (S1793962324500028BIB020) 2017
S1793962324500028BIB008
S1793962324500028BIB009
Brownlee J. (S1793962324500028BIB019) 2019
Zhang L. (S1793962324500028BIB024) 2016
S1793962324500028BIB003
S1793962324500028BIB014
S1793962324500028BIB004
S1793962324500028BIB015
S1793962324500028BIB005
S1793962324500028BIB016
S1793962324500028BIB006
S1793962324500028BIB010
S1793962324500028BIB021
S1793962324500028BIB022
S1793962324500028BIB001
S1793962324500028BIB023
Zhou Y. (S1793962324500028BIB012) 2018; 37
S1793962324500028BIB013
References_xml – ident: S1793962324500028BIB009
  doi: 10.1109/CVPR.2019.01034
– ident: S1793962324500028BIB006
  doi: 10.1145/3072959.3073699
– ident: S1793962324500028BIB023
  doi: 10.1109/TASLP.2019.2935843
– ident: S1793962324500028BIB022
  doi: 10.1007/978-3-642-12604-8_6
– volume-title: Deep Learning
  year: 2016
  ident: S1793962324500028BIB017
– ident: S1793962324500028BIB014
  doi: 10.1109/TAFFC.2020.3022017
– volume-title: Generative Adversarial Networks with Python: Deep Learning Generative Models for Image Synthesis and Image Translation
  year: 2019
  ident: S1793962324500028BIB019
– ident: S1793962324500028BIB007
  doi: 10.1145/3072959.3073658
– ident: S1793962324500028BIB013
  doi: 10.1109/TMM.2018.2887027
– ident: S1793962324500028BIB018
  doi: 10.1117/3.633187
– start-page: 1
  volume-title: 2016 IEEE Symp. Series on Computational Intelligence (SSCI)
  year: 2016
  ident: S1793962324500028BIB024
– ident: S1793962324500028BIB005
  doi: 10.1145/2897824.2925984
– ident: S1793962324500028BIB004
  doi: 10.1145/3388767.3407339
– ident: S1793962324500028BIB016
  doi: 10.3390/s21041249
– ident: S1793962324500028BIB008
  doi: 10.1007/978-3-030-58589-1_42
– volume-title: Long Short-Term Memory Networks with Python: Develop Sequence Prediction Models with Deep Learning
  year: 2017
  ident: S1793962324500028BIB020
– ident: S1793962324500028BIB001
  doi: 10.1109/MRA.2012.2192811
– volume: 37
  start-page: 1
  year: 2018
  ident: S1793962324500028BIB012
  publication-title: ACM Trans. Graph.
– ident: S1793962324500028BIB015
  doi: 10.1016/j.patcog.2020.107231
– ident: S1793962324500028BIB021
  doi: 10.1109/TMM.2013.2279659
– ident: S1793962324500028BIB003
  doi: 10.1145/3072959.3073640
– ident: S1793962324500028BIB010
  doi: 10.1109/TMM.2017.2766843
SSID ssj0000557891
Score 2.2646186
Snippet Visual speech is hard to recreate by human hands because animation itself is a time-consuming task: both precision and detail must be considered and match the...
SourceID proquest
crossref
worldscientific
SourceType Aggregation Database
Index Database
Publisher
SubjectTerms Animation
Audio data
Computer vision
Deep learning
Emotion recognition
Emotions
Feature extraction
Image reconstruction
Machine learning
Phonemes
Research Article
Software
Speech processing
Speech recognition
Synchronism
Visual tasks
Title Emotional 3D speech visualization from 2D audio visual data
URI http://www.worldscientific.com/doi/abs/10.1142/S1793962324500028
https://www.proquest.com/docview/3165613273
Volume 14
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3NT9swFLdYuezCNrZp3djkAxdWhaWOndjaCS1MCAEXqMQtchxn6yRatDZI8Nfv-SNOSpkEXKLKVl5Vv1-ff35-HwjtAuWWnMkYmFstIwocIOIp1xEtuQLGoZiqbLXPs_RoQo8v2WV3FWOzS5blvrp7MK_kOVqFMdCryZJ9gmaDUBiAz6BfeIKG4fkoHR-6Hjywykk-WlxrrX6PbqYLkyd554MITfYIyUeyqaZzPzfy-WiBlK56BXu1JGybHN_0ZDG98p2-7H2DS6Q0cUY2Kr1ZtlugieVpbILhlfXCnjTTLpx-_keGe4foVLoQZzs1-WWCS23EH7x823dGkC6srbWf8HePROpSiPd1f8xlbQajS3vgYg_bcmpqw56b141EQpn1BXYbV3tZf28_C1GGLueaFGsiXqBNAqcKMkCbB_npyXlwypmCZK7LYvgh_iYc5Hxbk7PKZboDypatdtsposdYLl6jLX_UwAcON2_Qhp5to1dtGw_srfpb9D3ACCc5djDCKzDCBkaY5NjCyM9hA6N3aPLz8OLHUeR7akQKqBiPNFdC15LKmlSpAIYEFp-XNIvjulJMJFpkJZNAe-OxHtd6zMoy5cxYdqYzLWTyHg1m85n-gHAmUl2LTBIma8qYLCVNlc4SXkkR04QO0dd2dYprVzql-K9KhminXb_C43xRJKY01DgBhj1Ee_fWNIhcE_XxKd_7Cb3scLyDBsu_jf4MNHNZfvHg-AecQne7
linkProvider EBSCOhost
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Emotional+3D+speech+visualization+from+2D+audio+visual+data&rft.jtitle=International+journal+of+modeling%2C+simulation+and+scientific+computing&rft.au=Guillermo%2C+Luis&rft.au=Rojas%2C+Jose-Maria&rft.au=Ugarte%2C+Willy&rft.date=2023-10-01&rft.issn=1793-9623&rft.eissn=1793-9615&rft.volume=14&rft.issue=5&rft_id=info:doi/10.1142%2FS1793962324500028&rft.externalDBID=n%2Fa&rft.externalDocID=10_1142_S1793962324500028
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1793-9623&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1793-9623&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1793-9623&client=summon