Photo-real talking head with deep bidirectional LSTM

Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in...

Full description

Saved in:
Bibliographic Details
Published in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 4884 - 4888
Main Authors Bo Fan, Lijuan Wang, Soong, Frank K., Lei Xie
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2015
Subjects
Online AccessGet full text
ISSN1520-6149
DOI10.1109/ICASSP.2015.7178899

Cover

Loading…
Abstract Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
AbstractList Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range dependencies more accurately than conventional RNNs. In this paper, we propose to use deep bidirectional LSTM (BLSTM) for audio/visual modeling in our photo-real talking head system. An audio/visual database of a subject's talking is firstly recorded as our training data. The audio/visual stereo data are converted into two parallel temporal sequences, i.e., contextual label sequences obtained by forced aligning audio against text, and visual feature sequences by applying active-appearance-model (AAM) on the lower face region among all the training image samples. The deep BLSTM is then trained to learn the regression model by minimizing the sum of square error (SSE) of predicting visual sequence from label sequence. After testing different network topologies, we interestingly found the best network is two BLSTM layers sitting on top of one feed-forward layer on our datasets. Compared with our previous HMM-based system, the newly proposed deep BLSTM-based one is better on both objective measurement and subjective A/B test.
Author Soong, Frank K.
Bo Fan
Lei Xie
Lijuan Wang
Author_xml – sequence: 1
  surname: Bo Fan
  fullname: Bo Fan
  email: bofan@nwpu-aslp.org
  organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China
– sequence: 2
  surname: Lijuan Wang
  fullname: Lijuan Wang
  email: lijuanw@microsoft.com
  organization: Microsoft Res. Asia, Beijing, China
– sequence: 3
  givenname: Frank K.
  surname: Soong
  fullname: Soong, Frank K.
  email: frankkps@microsoft.com
  organization: Microsoft Res. Asia, Beijing, China
– sequence: 4
  surname: Lei Xie
  fullname: Lei Xie
  email: lxie@nwpu-aslp.org
  organization: Sch. of Comput. Sci., Northwestern Polytech. Univ., Xi'an, China
BookMark eNotj81qwzAQhFVIoUmaJ8jFL2B3V5Yl7bGE_oFLA07PQZHWtVrXDrah9O1raE7DwMfwzUosur5jIbYIGSLQ3cvuvqr2mQQsMoPGWqIrsUKlTa6JjFmIJRYSUo2KbsRmHD8BAI02yqilUPumn_p0YNcmk2u_YveRNOxC8hOnJgnM5-QUQxzYT7HvZqisDq-34rp27cibS67F--PDYfeclm9Ps06ZRgl2SoM1HhBq5XRuvWSSJ1UX7L0GDCpo7ax2muVcnCIiBOlNXTMFKEwOlK_F9n83MvPxPMRvN_weLyfzPwfmRs0
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP.2015.7178899
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1467369977
9781467369978
EndPage 4888
ExternalDocumentID 7178899
Genre orig-research
GroupedDBID 23M
29P
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093
IEDL.DBID RIE
ISSN 1520-6149
IngestDate Wed Aug 27 01:41:45 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i208t-d87c010f4a638c2e92b4f5ecc601d4d66a86a6e21d4a4999102c7ffe9d0573093
PageCount 5
ParticipantIDs ieee_primary_7178899
PublicationCentury 2000
PublicationDate 20150401
PublicationDateYYYYMMDD 2015-04-01
PublicationDate_xml – month: 04
  year: 2015
  text: 20150401
  day: 01
PublicationDecade 2010
PublicationTitle 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev ICASSP
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001767474
ssj0008748
Score 2.2512226
Snippet Long short-term memory (LSTM) is a specific recurrent neural network (RNN) architecture that is designed to model temporal sequences and their long-range...
SourceID ieee
SourceType Publisher
StartPage 4884
SubjectTerms AAM
Active appearance model
BLSTM
Face
Hidden Markov models
RNN
Shape
Speech
talking head
Visualization
Title Photo-real talking head with deep bidirectional LSTM
URI https://ieeexplore.ieee.org/document/7178899
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLa2neDCY0O8lQNH0vWR9HFEE9NADE3aJu02pYkrJtA6ofbCr8dpywaIA5e26aGK5TT259ifAW48E4QiE5LHGUouVKx44qLiihYUrWfUBu2J7vg5HM3F40IuWnC7rYVBxCr5DB37WJ3lm1yXNlTWJ-gREz5oQ5vuda3WLp5iWWmsK9PswnFUdc4i82ThkUgaxiHPTfoPg7vpdGLTuqTTfPJHb5XKtAwPYPw1qTqj5NUpi9TRH7_4Gv8760Po7Yr42GRrno6ghetj2P_GP9gFMXnJi5yT3_jGyAm3UXNGm7NhNjrLDOKGpava6FURQ_Y0nY17MB_ezwYj3rRR4CvfjQtu4kgT6sqEon9N-5j4qcgkqY6wmBEmDFUcqhB9GiiLf8jl0FGWYWIsWaKbBCfQWedrPAXm6UCGgbYgkC6en0bCBEK4KcmYSiPOoGvlX25qpoxlI_r5368vYM_qoM6DuYRO8V7iFZn4Ir2udPsJVDuhPQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEJ4gHtSLDzC-3YNHF_rY7eNoiASUEhIg4Ua23WkkGkpMufjrnW0rqPHgpWl7aHYz2_3mm535BuDO1q4nUiF5kKLkQgWKhxYqrmhB0XrGRKM50Y2GXm8qnmZyVoP7TS0MIhbJZ9gyt8VZvs6StQmVtYl6BMQPdmCXcF_aZbXWNqJidGmMM1Ptw4Ff9M4igDIESYSV5pBthe1-52E8HpnELtmqPvqju0oBLt1DiL6GVeaUvLbWedxKPn4pNv533EfQ3JbxsdEGoI6hhssTOPimQNgAMXrJ8oyT5_jGyA03cXNG27NmJj7LNOKKxYsS9oqYIRuMJ1ETpt3HSafHq0YKfOFYQc514CfEu1Kh6G9LHAydWKSSjEdsTAvteSrwlIcOPSjDgMjpSPw0xVAbuUQrdE-hvsyWeAbMTlzpuYmhgXSxndgX2hXCimmOsdTiHBpm_vNVqZUxr6Z-8ffrW9jrTaLBfNAfPl_CvrFHmRVzBfX8fY3XBPh5fFPY-RP2H6SG
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2015+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Photo-real+talking+head+with+deep+bidirectional+LSTM&rft.au=Bo+Fan&rft.au=Lijuan+Wang&rft.au=Soong%2C+Frank+K.&rft.au=Lei+Xie&rft.date=2015-04-01&rft.pub=IEEE&rft.issn=1520-6149&rft.spage=4884&rft.epage=4888&rft_id=info:doi/10.1109%2FICASSP.2015.7178899&rft.externalDocID=7178899
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-6149&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-6149&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-6149&client=summon