Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis

Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment...

Full description

Saved in:
Bibliographic Details
Published inIEEE signal processing letters Vol. 27; pp. 2004 - 2008
Main Authors Lee, Joun Yeop, Cheon, Sung Jun, Choi, Byoung Jin, Kim, Nam Soo
Format Journal Article
LanguageEnglish
Published New York IEEE 2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment to avoid some serious failures such as repeating, missing, and mumbling phones is still an ongoing challenge. In order to mitigate these alignment failures, we propose a novel attention method called memory attention for e2e speech synthesis, which is inspired by the gating mechanism of the long-short term memory (LSTM). Leveraging the sequence modeling power of the gating techniques, memory attention can produce a stable alignment by controlling the amount of content-based and location-based information. For performance evaluation, we compared our proposed memory attention algorithm with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the experimental results, we conclude that memory attention can robustly generate various stylish speech.
AbstractList Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment to avoid some serious failures such as repeating, missing, and mumbling phones is still an ongoing challenge. In order to mitigate these alignment failures, we propose a novel attention method called memory attention for e2e speech synthesis, which is inspired by the gating mechanism of the long-short term memory (LSTM). Leveraging the sequence modeling power of the gating techniques, memory attention can produce a stable alignment by controlling the amount of content-based and location-based information. For performance evaluation, we compared our proposed memory attention algorithm with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the experimental results, we conclude that memory attention can robustly generate various stylish speech.
Author Kim, Nam Soo
Choi, Byoung Jin
Cheon, Sung Jun
Lee, Joun Yeop
Author_xml – sequence: 1
  givenname: Joun Yeop
  orcidid: 0000-0002-3316-4808
  surname: Lee
  fullname: Lee, Joun Yeop
  email: jylee@hi.snu.ac.kr
  organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea
– sequence: 2
  givenname: Sung Jun
  surname: Cheon
  fullname: Cheon, Sung Jun
  email: sjcheon@hi.snu.ac.kr
  organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea
– sequence: 3
  givenname: Byoung Jin
  surname: Choi
  fullname: Choi, Byoung Jin
  email: bjchoi@hi.snu.ac.kr
  organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea
– sequence: 4
  givenname: Nam Soo
  orcidid: 0000-0002-0568-4902
  surname: Kim
  fullname: Kim, Nam Soo
  email: nkim@snu.ac.kr
  organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea
BookMark eNo9kM1PAjEQxRuDiYDeTbw08bw4_dptvRGCaALRiBw8bfZjFpZAi9ty4L-3BOPpvcy8N5P8BqRnnUVC7hmMGAPzNF9-jDhwGAkQqZDmivSZUjrhImW96CGDxBjQN2Tg_RYANNOqT74XuHfdiY5DQBtaZ5_ppyuPPtDxrl3bfRzSlW_tms6KcJYFVpvCtn5PG9fRqa2T4JIodHnAuKLLkw0b9K2_JddNsfN496dDsnqZfk1ek_n77G0ynicVNywkTSEZcpRGQpEyVMqgLkWmeGlQmVIC1FKBiAbjrKpNzTjqjAtt6pRXmRiSx8vdQ-d-juhDvnXHzsaXOZepklJlUscUXFJV57zvsMkPXbsvulPOID8DzCPA_Aww_wMYKw-XSouI_3HDFSjGxS95cGzK
CODEN ISPLEM
CitedBy_id crossref_primary_10_1109_TAFFC_2022_3175578
crossref_primary_10_1109_TII_2021_3078192
Cites_doi 10.21437/Interspeech.2017-1452
10.1609/aaai.v33i01.33016706
10.1007/978-3-540-49127-9_5
10.1007/978-3-642-24797-2_2
10.1109/ICASSP40776.2020.9054106
10.1109/TASSP.1984.1164317
10.1109/ICASSP40776.2020.9054119
10.1109/ICASSP.2018.8461829
10.1109/ICASSP.2018.8462105
10.1109/ICASSP.2018.8462020
10.18653/v1/D18-1336
10.21437/Interspeech.2018-1616
10.1109/ICASSP.2018.8461368
10.1109/ASRU46091.2019.9003956
10.21437/Interspeech.2019-1972
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/LSP.2020.3036349
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) 1998-Present
IEEE Xplore
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2361
EndPage 2008
ExternalDocumentID 10_1109_LSP_2020_3036349
9250512
Genre orig-research
GrantInformation_xml – fundername: Korea Government
  grantid: 2020-0-00059
– fundername: Institute of Information & Communications Technology Planning & Evaluation
GroupedDBID -~X
.DC
0R~
0ZS
29I
3EH
4.4
5GY
5VS
6IK
85S
97E
AAJGR
AASAJ
AAYJJ
ABFSI
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AI.
AIBXA
AKJIK
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RIG
RNS
TAE
TN5
VH1
XFK
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c291t-fa41e2e4940a61e559e8b3752b9e59b400d4503b40e52bcd9d12e872389d62c73
IEDL.DBID RIE
ISSN 1070-9908
IngestDate Fri Sep 13 02:58:04 EDT 2024
Fri Aug 23 02:31:15 EDT 2024
Wed Jun 26 19:26:40 EDT 2024
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c291t-fa41e2e4940a61e559e8b3752b9e59b400d4503b40e52bcd9d12e872389d62c73
ORCID 0000-0002-0568-4902
0000-0002-3316-4808
PQID 2465445748
PQPubID 75747
PageCount 5
ParticipantIDs proquest_journals_2465445748
crossref_primary_10_1109_LSP_2020_3036349
ieee_primary_9250512
PublicationCentury 2000
PublicationDate 20200000
2020-00-00
20200101
PublicationDateYYYYMMDD 2020-01-01
PublicationDate_xml – year: 2020
  text: 20200000
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE signal processing letters
PublicationTitleAbbrev LSP
PublicationYear 2020
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
ref14
ref11
ref10
arik (ref6) 0
panayotov (ref24) 0
ren (ref9) 0
ref1
ref17
ref16
ref19
dauphin (ref25) 0
ping (ref8) 0
chorowski (ref2) 0
ref23
van den oord (ref21) 2016
ref22
kingma (ref18) 0
ref4
ref3
gibiansky (ref7) 0
ref5
ito (ref20) 2017
References_xml – ident: ref4
  doi: 10.21437/Interspeech.2017-1452
– ident: ref16
  doi: 10.1609/aaai.v33i01.33016706
– start-page: 1
  year: 0
  ident: ref8
  article-title: Deep voice 3: Scaling text-to-speech with convolutional sequence learning
  publication-title: Proc Int Conf Learn Represent
  contributor:
    fullname: ping
– start-page: 577
  year: 0
  ident: ref2
  article-title: Attention-based models for speech recognition
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: chorowski
– start-page: 3165
  year: 0
  ident: ref9
  article-title: Fastspeech: Fast, robust and controllable text to speech
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: ren
– year: 2016
  ident: ref21
  article-title: WaveNet: A generative model for raw audio
  contributor:
    fullname: van den oord
– start-page: 1
  year: 0
  ident: ref18
  article-title: Adam: A method for stochastic optimization
  publication-title: Proc Int Conf Learn Representations
  contributor:
    fullname: kingma
– ident: ref19
  doi: 10.1007/978-3-540-49127-9_5
– ident: ref15
  doi: 10.1007/978-3-642-24797-2_2
– ident: ref11
  doi: 10.1109/ICASSP40776.2020.9054106
– ident: ref22
  doi: 10.1109/TASSP.1984.1164317
– ident: ref14
  doi: 10.1109/ICASSP40776.2020.9054119
– ident: ref17
  doi: 10.1109/ICASSP.2018.8461829
– ident: ref3
  doi: 10.1109/ICASSP.2018.8462105
– start-page: 2962
  year: 0
  ident: ref7
  article-title: Deep voice 2: Multi-speaker neural text-to-speech
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: gibiansky
– start-page: 195
  year: 0
  ident: ref6
  article-title: Deep voice: Real-time neural text-to-speech
  publication-title: Proc 34th Int Conf Mach Learn
  contributor:
    fullname: arik
– start-page: 933
  year: 0
  ident: ref25
  article-title: Language modeling with gated convolutional networks
  publication-title: Proc 34th Int Conf Mach Learn
  contributor:
    fullname: dauphin
– ident: ref10
  doi: 10.1109/ICASSP.2018.8462020
– year: 2017
  ident: ref20
  article-title: The LJ speech dataset
  contributor:
    fullname: ito
– ident: ref1
  doi: 10.18653/v1/D18-1336
– ident: ref23
  doi: 10.21437/Interspeech.2018-1616
– ident: ref5
  doi: 10.1109/ICASSP.2018.8461368
– start-page: 5206
  year: 0
  ident: ref24
  article-title: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis
  publication-title: Proc IEEE Int Conf Acoust Speech Signal Process
  contributor:
    fullname: panayotov
– ident: ref13
  doi: 10.1109/ASRU46091.2019.9003956
– ident: ref12
  doi: 10.21437/Interspeech.2019-1972
SSID ssj0008185
Score 2.3421283
Snippet Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence....
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Publisher
StartPage 2004
SubjectTerms Algorithms
Alignment
Attention mechanism
Computational modeling
Decoding
end-to-end speech synthesis
Logic gates
memory attention
Memory management
Performance evaluation
Robustness
Speech
Speech recognition
Speech synthesis
Training
Title Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
URI https://ieeexplore.ieee.org/document/9250512
https://www.proquest.com/docview/2465445748/abstract/
Volume 27
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8QwEA66Jz34FtcXOXgRzNpNk7bxtsiqiCviKuipNMlUFrVdbHtYf71J2i6iHjw19EWYSeaR-WYGoSOQQWBNUcJ9rggzW5tImVISpEb3Kc18rVy1z9vg6pFdP_GnBXQyz4UBAAc-g54duli-zlVlj8pOhdXXtqXwYuTROldrLnWt4qnxhR4xEjZqQ5KeOL0Z3xlHkBr_1EYtbdXMbyrI9VT5JYiddrlYRaN2XjWo5LVXlbKnPn-UbPzvxNfQSmNm4kG9LtbRAmQbaPlb8cFN9DyyINsZHpRlDXk8w_e5rIoSD94mLw4kgB2gAF8mFhuNR2CzhCfFOzaGLh5mmpQ5MRc8noJ5hMezzFiTxaTYQo8Xw4fzK9I0WiCKin5J0oT1gQITzEuCPhgnAyLph5xKAVxIs801455vBmDuKS10n0Jk25UJHVAV-tuok-UZ7CCc-lQoGklpBClLuBKaa49q7ic8TDmEXXTc0j6e1vU0YueHeCI2fIotn-KGT120aUk5f6-hYhftt8yKmw1XxNTWhWM8ZNHu31_toSX77_r0ZB91yo8KDow9UcpDt5C-AFK1xpo
link.rule.ids 315,786,790,802,4043,27956,27957,27958,55109
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4heigc6AMQS2nrA5dKeMk6dhL3tqqg27KLqi5IcIpie4JWLVlEsgf66zt2sitUOPQUKw_FmrHn4flmBuAQTZJ4U5SrWFkuaWtzY0rBk5J0n3UydjZU-zxPRpfy-5W6WoOjVS4MIgbwGfb9MMTy3dwu_FHZsfb62rcUfkF6PtJtttZK7nrV0yIMI04yNlsGJSN9PJ7-IFdQkIfq45a-buYjJRS6qjwRxUG_nL6CyXJmLazkV3_RmL7980_Rxv-d-mvY6gxNNmxXxhtYw-otbD4qP7gN1xMPs31gw6ZpQY-f2c-5WdQNG_6e3QSYAAuQAva18OhoNkGfJzyrbxmZuuykcryZc7qw6R3SIzZ9qMierGf1Dlyenlx8GfGu1QK3Qg8aXhZygAKlllGRDJDcDMxMnCphNCptaKM7qaKYBkj3rNNuIDDzDcu0S4RN411Yr-YV7gErY6GtyIwhUSoLZbVTLhJOxYVKS4VpDz4taZ_ftRU18uCJRDonPuWeT3nHpx5se1Ku3uuo2IODJbPybsvVufCV4aRKZbb__Fcf4eXoYjLOx9_Oz97Bhv9Pe5ZyAOvN_QLfk3XRmA9hUf0F_K7J8A
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Memory+Attention%3A+Robust+Alignment+Using+Gating+Mechanism+for+End-to-End+Speech+Synthesis&rft.jtitle=IEEE+signal+processing+letters&rft.au=Lee%2C+Joun+Yeop&rft.au=Cheon%2C+Sung+Jun&rft.au=Choi%2C+Byoung+Jin&rft.au=Kim%2C+Nam+Soo&rft.date=2020&rft.issn=1070-9908&rft.eissn=1558-2361&rft.volume=27&rft.spage=2004&rft.epage=2008&rft_id=info:doi/10.1109%2FLSP.2020.3036349&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_LSP_2020_3036349
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1070-9908&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1070-9908&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1070-9908&client=summon