Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis
Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment...
Saved in:
Published in | IEEE signal processing letters Vol. 27; pp. 2004 - 2008 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment to avoid some serious failures such as repeating, missing, and mumbling phones is still an ongoing challenge. In order to mitigate these alignment failures, we propose a novel attention method called memory attention for e2e speech synthesis, which is inspired by the gating mechanism of the long-short term memory (LSTM). Leveraging the sequence modeling power of the gating techniques, memory attention can produce a stable alignment by controlling the amount of content-based and location-based information. For performance evaluation, we compared our proposed memory attention algorithm with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the experimental results, we conclude that memory attention can robustly generate various stylish speech. |
---|---|
AbstractList | Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence. Attention-based e2e approach has shown state-of-the-art performance in speech synthesis. However, generating stable and robust attention alignment to avoid some serious failures such as repeating, missing, and mumbling phones is still an ongoing challenge. In order to mitigate these alignment failures, we propose a novel attention method called memory attention for e2e speech synthesis, which is inspired by the gating mechanism of the long-short term memory (LSTM). Leveraging the sequence modeling power of the gating techniques, memory attention can produce a stable alignment by controlling the amount of content-based and location-based information. For performance evaluation, we compared our proposed memory attention algorithm with various conventional attention techniques in single speaker and emotional speech synthesis scenarios. From the experimental results, we conclude that memory attention can robustly generate various stylish speech. |
Author | Kim, Nam Soo Choi, Byoung Jin Cheon, Sung Jun Lee, Joun Yeop |
Author_xml | – sequence: 1 givenname: Joun Yeop orcidid: 0000-0002-3316-4808 surname: Lee fullname: Lee, Joun Yeop email: jylee@hi.snu.ac.kr organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea – sequence: 2 givenname: Sung Jun surname: Cheon fullname: Cheon, Sung Jun email: sjcheon@hi.snu.ac.kr organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea – sequence: 3 givenname: Byoung Jin surname: Choi fullname: Choi, Byoung Jin email: bjchoi@hi.snu.ac.kr organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea – sequence: 4 givenname: Nam Soo orcidid: 0000-0002-0568-4902 surname: Kim fullname: Kim, Nam Soo email: nkim@snu.ac.kr organization: Department of Electrical and Computer Engineering and with the Institute of New Media and Communications, Seoul National University, Seoul, South Korea |
BookMark | eNo9kM1PAjEQxRuDiYDeTbw08bw4_dptvRGCaALRiBw8bfZjFpZAi9ty4L-3BOPpvcy8N5P8BqRnnUVC7hmMGAPzNF9-jDhwGAkQqZDmivSZUjrhImW96CGDxBjQN2Tg_RYANNOqT74XuHfdiY5DQBtaZ5_ppyuPPtDxrl3bfRzSlW_tms6KcJYFVpvCtn5PG9fRqa2T4JIodHnAuKLLkw0b9K2_JddNsfN496dDsnqZfk1ek_n77G0ynicVNywkTSEZcpRGQpEyVMqgLkWmeGlQmVIC1FKBiAbjrKpNzTjqjAtt6pRXmRiSx8vdQ-d-juhDvnXHzsaXOZepklJlUscUXFJV57zvsMkPXbsvulPOID8DzCPA_Aww_wMYKw-XSouI_3HDFSjGxS95cGzK |
CODEN | ISPLEM |
CitedBy_id | crossref_primary_10_1109_TAFFC_2022_3175578 crossref_primary_10_1109_TII_2021_3078192 |
Cites_doi | 10.21437/Interspeech.2017-1452 10.1609/aaai.v33i01.33016706 10.1007/978-3-540-49127-9_5 10.1007/978-3-642-24797-2_2 10.1109/ICASSP40776.2020.9054106 10.1109/TASSP.1984.1164317 10.1109/ICASSP40776.2020.9054119 10.1109/ICASSP.2018.8461829 10.1109/ICASSP.2018.8462105 10.1109/ICASSP.2018.8462020 10.18653/v1/D18-1336 10.21437/Interspeech.2018-1616 10.1109/ICASSP.2018.8461368 10.1109/ASRU46091.2019.9003956 10.21437/Interspeech.2019-1972 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/LSP.2020.3036349 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998-Present IEEE Xplore CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2361 |
EndPage | 2008 |
ExternalDocumentID | 10_1109_LSP_2020_3036349 9250512 |
Genre | orig-research |
GrantInformation_xml | – fundername: Korea Government grantid: 2020-0-00059 – fundername: Institute of Information & Communications Technology Planning & Evaluation |
GroupedDBID | -~X .DC 0R~ 0ZS 29I 3EH 4.4 5GY 5VS 6IK 85S 97E AAJGR AASAJ AAYJJ ABFSI ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AI. AIBXA AKJIK ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RIG RNS TAE TN5 VH1 XFK AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c291t-fa41e2e4940a61e559e8b3752b9e59b400d4503b40e52bcd9d12e872389d62c73 |
IEDL.DBID | RIE |
ISSN | 1070-9908 |
IngestDate | Fri Sep 13 02:58:04 EDT 2024 Fri Aug 23 02:31:15 EDT 2024 Wed Jun 26 19:26:40 EDT 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c291t-fa41e2e4940a61e559e8b3752b9e59b400d4503b40e52bcd9d12e872389d62c73 |
ORCID | 0000-0002-0568-4902 0000-0002-3316-4808 |
PQID | 2465445748 |
PQPubID | 75747 |
PageCount | 5 |
ParticipantIDs | proquest_journals_2465445748 crossref_primary_10_1109_LSP_2020_3036349 ieee_primary_9250512 |
PublicationCentury | 2000 |
PublicationDate | 20200000 2020-00-00 20200101 |
PublicationDateYYYYMMDD | 2020-01-01 |
PublicationDate_xml | – year: 2020 text: 20200000 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE signal processing letters |
PublicationTitleAbbrev | LSP |
PublicationYear | 2020 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 ref14 ref11 ref10 arik (ref6) 0 panayotov (ref24) 0 ren (ref9) 0 ref1 ref17 ref16 ref19 dauphin (ref25) 0 ping (ref8) 0 chorowski (ref2) 0 ref23 van den oord (ref21) 2016 ref22 kingma (ref18) 0 ref4 ref3 gibiansky (ref7) 0 ref5 ito (ref20) 2017 |
References_xml | – ident: ref4 doi: 10.21437/Interspeech.2017-1452 – ident: ref16 doi: 10.1609/aaai.v33i01.33016706 – start-page: 1 year: 0 ident: ref8 article-title: Deep voice 3: Scaling text-to-speech with convolutional sequence learning publication-title: Proc Int Conf Learn Represent contributor: fullname: ping – start-page: 577 year: 0 ident: ref2 article-title: Attention-based models for speech recognition publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: chorowski – start-page: 3165 year: 0 ident: ref9 article-title: Fastspeech: Fast, robust and controllable text to speech publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: ren – year: 2016 ident: ref21 article-title: WaveNet: A generative model for raw audio contributor: fullname: van den oord – start-page: 1 year: 0 ident: ref18 article-title: Adam: A method for stochastic optimization publication-title: Proc Int Conf Learn Representations contributor: fullname: kingma – ident: ref19 doi: 10.1007/978-3-540-49127-9_5 – ident: ref15 doi: 10.1007/978-3-642-24797-2_2 – ident: ref11 doi: 10.1109/ICASSP40776.2020.9054106 – ident: ref22 doi: 10.1109/TASSP.1984.1164317 – ident: ref14 doi: 10.1109/ICASSP40776.2020.9054119 – ident: ref17 doi: 10.1109/ICASSP.2018.8461829 – ident: ref3 doi: 10.1109/ICASSP.2018.8462105 – start-page: 2962 year: 0 ident: ref7 article-title: Deep voice 2: Multi-speaker neural text-to-speech publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: gibiansky – start-page: 195 year: 0 ident: ref6 article-title: Deep voice: Real-time neural text-to-speech publication-title: Proc 34th Int Conf Mach Learn contributor: fullname: arik – start-page: 933 year: 0 ident: ref25 article-title: Language modeling with gated convolutional networks publication-title: Proc 34th Int Conf Mach Learn contributor: fullname: dauphin – ident: ref10 doi: 10.1109/ICASSP.2018.8462020 – year: 2017 ident: ref20 article-title: The LJ speech dataset contributor: fullname: ito – ident: ref1 doi: 10.18653/v1/D18-1336 – ident: ref23 doi: 10.21437/Interspeech.2018-1616 – ident: ref5 doi: 10.1109/ICASSP.2018.8461368 – start-page: 5206 year: 0 ident: ref24 article-title: Forward attention in sequence-to-sequence acoustic modeling for speech synthesis publication-title: Proc IEEE Int Conf Acoust Speech Signal Process contributor: fullname: panayotov – ident: ref13 doi: 10.1109/ASRU46091.2019.9003956 – ident: ref12 doi: 10.21437/Interspeech.2019-1972 |
SSID | ssj0008185 |
Score | 2.3421283 |
Snippet | Recent end-to-end (e2e) speech synthesis systems usually employ attention techniques to align an input text sequence against a mel-spectrogram sequence.... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Publisher |
StartPage | 2004 |
SubjectTerms | Algorithms Alignment Attention mechanism Computational modeling Decoding end-to-end speech synthesis Logic gates memory attention Memory management Performance evaluation Robustness Speech Speech recognition Speech synthesis Training |
Title | Memory Attention: Robust Alignment Using Gating Mechanism for End-to-End Speech Synthesis |
URI | https://ieeexplore.ieee.org/document/9250512 https://www.proquest.com/docview/2465445748/abstract/ |
Volume | 27 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8QwEA66Jz34FtcXOXgRzNpNk7bxtsiqiCviKuipNMlUFrVdbHtYf71J2i6iHjw19EWYSeaR-WYGoSOQQWBNUcJ9rggzW5tImVISpEb3Kc18rVy1z9vg6pFdP_GnBXQyz4UBAAc-g54duli-zlVlj8pOhdXXtqXwYuTROldrLnWt4qnxhR4xEjZqQ5KeOL0Z3xlHkBr_1EYtbdXMbyrI9VT5JYiddrlYRaN2XjWo5LVXlbKnPn-UbPzvxNfQSmNm4kG9LtbRAmQbaPlb8cFN9DyyINsZHpRlDXk8w_e5rIoSD94mLw4kgB2gAF8mFhuNR2CzhCfFOzaGLh5mmpQ5MRc8noJ5hMezzFiTxaTYQo8Xw4fzK9I0WiCKin5J0oT1gQITzEuCPhgnAyLph5xKAVxIs801455vBmDuKS10n0Jk25UJHVAV-tuok-UZ7CCc-lQoGklpBClLuBKaa49q7ic8TDmEXXTc0j6e1vU0YueHeCI2fIotn-KGT120aUk5f6-hYhftt8yKmw1XxNTWhWM8ZNHu31_toSX77_r0ZB91yo8KDow9UcpDt5C-AFK1xpo |
link.rule.ids | 315,786,790,802,4043,27956,27957,27958,55109 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4heigc6AMQS2nrA5dKeMk6dhL3tqqg27KLqi5IcIpie4JWLVlEsgf66zt2sitUOPQUKw_FmrHn4flmBuAQTZJ4U5SrWFkuaWtzY0rBk5J0n3UydjZU-zxPRpfy-5W6WoOjVS4MIgbwGfb9MMTy3dwu_FHZsfb62rcUfkF6PtJtttZK7nrV0yIMI04yNlsGJSN9PJ7-IFdQkIfq45a-buYjJRS6qjwRxUG_nL6CyXJmLazkV3_RmL7980_Rxv-d-mvY6gxNNmxXxhtYw-otbD4qP7gN1xMPs31gw6ZpQY-f2c-5WdQNG_6e3QSYAAuQAva18OhoNkGfJzyrbxmZuuykcryZc7qw6R3SIzZ9qMierGf1Dlyenlx8GfGu1QK3Qg8aXhZygAKlllGRDJDcDMxMnCphNCptaKM7qaKYBkj3rNNuIDDzDcu0S4RN411Yr-YV7gErY6GtyIwhUSoLZbVTLhJOxYVKS4VpDz4taZ_ftRU18uCJRDonPuWeT3nHpx5se1Ku3uuo2IODJbPybsvVufCV4aRKZbb__Fcf4eXoYjLOx9_Oz97Bhv9Pe5ZyAOvN_QLfk3XRmA9hUf0F_K7J8A |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Memory+Attention%3A+Robust+Alignment+Using+Gating+Mechanism+for+End-to-End+Speech+Synthesis&rft.jtitle=IEEE+signal+processing+letters&rft.au=Lee%2C+Joun+Yeop&rft.au=Cheon%2C+Sung+Jun&rft.au=Choi%2C+Byoung+Jin&rft.au=Kim%2C+Nam+Soo&rft.date=2020&rft.issn=1070-9908&rft.eissn=1558-2361&rft.volume=27&rft.spage=2004&rft.epage=2008&rft_id=info:doi/10.1109%2FLSP.2020.3036349&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_LSP_2020_3036349 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1070-9908&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1070-9908&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1070-9908&client=summon |