Photo-real talking head with deep bidirectional LSTM

Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 4884 - 4888
Main Authors	Bo Fan, Lijuan Wang, Soong, Frank K., Lei Xie
Format	Conference Proceeding
Language	English
Published	IEEE 01.04.2015
Subjects	AAM Active appearance model BLSTM Face Hidden Markov models RNN Shape Speech talking head Visualization
Online Access	Get full text
ISSN	1520-6149
DOI	10.1109/ICASSP.2015.7178899

Cover

Loading…

Abstract	Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
AbstractList	Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
Author	Soong, Frank K. Bo Fan Lei Xie Lijuan Wang
Author_xml	– sequence: 1 surname: Bo Fan fullname: Bo Fan email: bofan@nwpu-aslp.org organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China – sequence: 2 surname: Lijuan Wang fullname: Lijuan Wang email: lijuanw@microsoft.com organization: Microsoft Res. Asia, Beijing, China – sequence: 3 givenname: Frank K. surname: Soong fullname: Soong, Frank K. email: frankkps@microsoft.com organization: Microsoft Res. Asia, Beijing, China – sequence: 4 surname: Lei Xie fullname: Lei Xie email: lxie@nwpu-aslp.org organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China
BookMark	eNotj81qwzAQhFVIoUmaJ8jFL2B3V5Yl7bGE_oFLA07PQZHWtVrXDrah9O1raE7DwMfwzUosur5jIbYIGSLQ3cvuvqr2mQQsMoPGWqIrsUKlTa6JjFmIJRYSUo2KbsRmHD8BAI02yqilUPumn_p0YNcmk2u_YveRNOxC8hOnJgnM5-QUQxzYT7HvZqisDq-34rp27cibS67F--PDYfeclm9Ps06ZRgl2SoM1HhBq5XRuvWSSJ1UX7L0GDCpo7ax2muVcnCIiBOlNXTMFKEwOlK_F9n83MvPxPMRvN_weLyfzPwfmRs0
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/ICASSP.2015.7178899
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	1467369977 9781467369978
EndPage	4888
ExternalDocumentID	7178899
Genre	orig-research
GroupedDBID	23M 29P 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS
ID	FETCH-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093
IEDL.DBID	RIE
ISSN	1520-6149
IngestDate	Wed Aug 27 01:41:45 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093
PageCount	5
ParticipantIDs	ieee_primary_7178899
PublicationCentury	2000
PublicationDate	20150401
PublicationDateYYYYMMDD	2015-04-01
PublicationDate_xml	– month: 04 year: 2015 text: 20150401 day: 01
PublicationDecade	2010
PublicationTitle	2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev	ICASSP
PublicationYear	2015
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0001767474 ssj0008748
Score	2.2512226
Snippet	Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range...
SourceID	ieee
SourceType	Publisher
StartPage	4884
SubjectTerms	AAM Active appearance model BLSTM Face Hidden Markov models RNN Shape Speech talking head Visualization
Title	Photo-real talking head with deep bidirectional LSTM
URI	https://ieeexplore.ieee.org/document/7178899
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLa2neDCY0O8lQNH0vWR9HFEE9NADE3aJu02pYkrJtA6ofbCr8dpywaIA5e26aGK5TT259ifAW48E4QiE5LHGUouVKx44qLiihYUrWfUBu2J7vg5HM3F40IuWnC7rYVBxCr5DB37WJ3lm1yXNlTWJ-gREz5oQ5vuda3WLp5iWWmsK9PswnFUdc4i82ThkUgaxiHPTfoPg7vpdGLTuqTTfPJHb5XKtAwPYPw1qTqj5NUpi9TRH7_4Gv8760Po7Yr42GRrno6ghetj2P_GP9gFMXnJi5yT3_jGyAm3UXNGm7NhNjrLDOKGpava6FURQ_Y0nY17MB_ezwYj3rRR4CvfjQtu4kgT6sqEon9N-5j4qcgkqY6wmBEmDFUcqhB9GiiLf8jl0FGWYWIsWaKbBCfQWedrPAXm6UCGgbYgkC6en0bCBEK4KcmYSiPOoGvlX25qpoxlI_r5368vYM_qoM6DuYRO8V7iFZn4Ir2udPsJVDuhPQ
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEJ4gHtSLDzC-3YNHF_rY7eNoiASUEhIg4Ua23WkkGkpMufjrnW0rqPHgpWl7aHYz2_3mm535BuDO1q4nUiF5kKLkQgWKhxYqrmhB0XrGRKM50Y2GXm8qnmZyVoP7TS0MIhbJZ9gyt8VZvs6StQmVtYl6BMQPdmCXcF_aZbXWNqJidGmMM1Ptw4Ff9M4igDIESYSV5pBthe1-52E8HpnELtmqPvqju0oBLt1DiL6GVeaUvLbWedxKPn4pNv533EfQ3JbxsdEGoI6hhssTOPimQNgAMXrJ8oyT5_jGyA03cXNG27NmJj7LNOKKxYsS9oqYIRuMJ1ETpt3HSafHq0YKfOFYQc514CfEu1Kh6G9LHAydWKSSjEdsTAvteSrwlIcOPSjDgMjpSPw0xVAbuUQrdE-hvsyWeAbMTlzpuYmhgXSxndgX2hXCimmOsdTiHBpm_vNVqZUxr6Z-8ffrW9jrTaLBfNAfPl_CvrFHmRVzBfX8fY3XBPh5fFPY-RP2H6SG
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Photo-real+talking+head+with+deep+bidirectional+LSTM&rft.au=Bo+Fan&rft.au=Lijuan+Wang&rft.au=Soong%2C+Frank+K.&rft.au=Lei+Xie&rft.date=2015-04-01&rft.pub=IEEE&rft.issn=1520-6149&rft.spage=4884&rft.epage=4888&rft_id=info:doi/10.1109%2FICASSP.2015.7178899&rft.externalDocID=7178899
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-6149&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-6149&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-6149&client=summon