Photo-real talking head with deep bidirectional LSTM
Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in...
Saved in:
Published in | 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 4884 - 4888 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.04.2015
|
Subjects | |
Online Access | Get full text |
ISSN | 1520-6149 |
DOI | 10.1109/ICASSP.2015.7178899 |
Cover
Loading…
Abstract | Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test. |
---|---|
AbstractList | Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test. |
Author | Soong, Frank K. Bo Fan Lei Xie Lijuan Wang |
Author_xml | – sequence: 1 surname: Bo Fan fullname: Bo Fan email: bofan@nwpu-aslp.org organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China – sequence: 2 surname: Lijuan Wang fullname: Lijuan Wang email: lijuanw@microsoft.com organization: Microsoft Res. Asia, Beijing, China – sequence: 3 givenname: Frank K. surname: Soong fullname: Soong, Frank K. email: frankkps@microsoft.com organization: Microsoft Res. Asia, Beijing, China – sequence: 4 surname: Lei Xie fullname: Lei Xie email: lxie@nwpu-aslp.org organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China |
BookMark | eNotj81qwzAQhFVIoUmaJ8jFL2B3V5Yl7bGE_oFLA07PQZHWtVrXDrah9O1raE7DwMfwzUosur5jIbYIGSLQ3cvuvqr2mQQsMoPGWqIrsUKlTa6JjFmIJRYSUo2KbsRmHD8BAI02yqilUPumn_p0YNcmk2u_YveRNOxC8hOnJgnM5-QUQxzYT7HvZqisDq-34rp27cibS67F--PDYfeclm9Ps06ZRgl2SoM1HhBq5XRuvWSSJ1UX7L0GDCpo7ax2muVcnCIiBOlNXTMFKEwOlK_F9n83MvPxPMRvN_weLyfzPwfmRs0 |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICASSP.2015.7178899 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISBN | 1467369977 9781467369978 |
EndPage | 4888 |
ExternalDocumentID | 7178899 |
Genre | orig-research |
GroupedDBID | 23M 29P 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
ID | FETCH-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093 |
IEDL.DBID | RIE |
ISSN | 1520-6149 |
IngestDate | Wed Aug 27 01:41:45 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093 |
PageCount | 5 |
ParticipantIDs | ieee_primary_7178899 |
PublicationCentury | 2000 |
PublicationDate | 20150401 |
PublicationDateYYYYMMDD | 2015-04-01 |
PublicationDate_xml | – month: 04 year: 2015 text: 20150401 day: 01 |
PublicationDecade | 2010 |
PublicationTitle | 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
PublicationTitleAbbrev | ICASSP |
PublicationYear | 2015 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0001767474 ssj0008748 |
Score | 2.2512226 |
Snippet | Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 4884 |
SubjectTerms | AAM Active appearance model BLSTM Face Hidden Markov models RNN Shape Speech talking head Visualization |
Title | Photo-real talking head with deep bidirectional LSTM |
URI | https://ieeexplore.ieee.org/document/7178899 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLa2neDCY0O8lQNH0vWR9HFEE9NADE3aJu02pYkrJtA6ofbCr8dpywaIA5e26aGK5TT259ifAW48E4QiE5LHGUouVKx44qLiihYUrWfUBu2J7vg5HM3F40IuWnC7rYVBxCr5DB37WJ3lm1yXNlTWJ-gREz5oQ5vuda3WLp5iWWmsK9PswnFUdc4i82ThkUgaxiHPTfoPg7vpdGLTuqTTfPJHb5XKtAwPYPw1qTqj5NUpi9TRH7_4Gv8760Po7Yr42GRrno6ghetj2P_GP9gFMXnJi5yT3_jGyAm3UXNGm7NhNjrLDOKGpava6FURQ_Y0nY17MB_ezwYj3rRR4CvfjQtu4kgT6sqEon9N-5j4qcgkqY6wmBEmDFUcqhB9GiiLf8jl0FGWYWIsWaKbBCfQWedrPAXm6UCGgbYgkC6en0bCBEK4KcmYSiPOoGvlX25qpoxlI_r5368vYM_qoM6DuYRO8V7iFZn4Ir2udPsJVDuhPQ |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEJ4gHtSLDzC-3YNHF_rY7eNoiASUEhIg4Ua23WkkGkpMufjrnW0rqPHgpWl7aHYz2_3mm535BuDO1q4nUiF5kKLkQgWKhxYqrmhB0XrGRKM50Y2GXm8qnmZyVoP7TS0MIhbJZ9gyt8VZvs6StQmVtYl6BMQPdmCXcF_aZbXWNqJidGmMM1Ptw4Ff9M4igDIESYSV5pBthe1-52E8HpnELtmqPvqju0oBLt1DiL6GVeaUvLbWedxKPn4pNv533EfQ3JbxsdEGoI6hhssTOPimQNgAMXrJ8oyT5_jGyA03cXNG27NmJj7LNOKKxYsS9oqYIRuMJ1ETpt3HSafHq0YKfOFYQc514CfEu1Kh6G9LHAydWKSSjEdsTAvteSrwlIcOPSjDgMjpSPw0xVAbuUQrdE-hvsyWeAbMTlzpuYmhgXSxndgX2hXCimmOsdTiHBpm_vNVqZUxr6Z-8ffrW9jrTaLBfNAfPl_CvrFHmRVzBfX8fY3XBPh5fFPY-RP2H6SG |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Photo-real+talking+head+with+deep+bidirectional+LSTM&rft.au=Bo+Fan&rft.au=Lijuan+Wang&rft.au=Soong%2C+Frank+K.&rft.au=Lei+Xie&rft.date=2015-04-01&rft.pub=IEEE&rft.issn=1520-6149&rft.spage=4884&rft.epage=4888&rft_id=info:doi/10.1109%2FICASSP.2015.7178899&rft.externalDocID=7178899 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-6149&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-6149&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-6149&client=summon |