Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis
Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no long...
Saved in:
Published in | IEEE access Vol. 7; pp. 65955 - 65964 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Piscataway
IEEE
2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech. |
---|---|
AbstractList | Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech. |
Author | Xie, Lei Zhang, Yuchao Zhu, Xiaolian Xue, Liumeng Yang, Shan |
Author_xml | – sequence: 1 givenname: Xiaolian orcidid: 0000-0002-8842-7329 surname: Zhu fullname: Zhu, Xiaolian organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 2 givenname: Yuchao surname: Zhang fullname: Zhang, Yuchao organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 3 givenname: Shan surname: Yang fullname: Yang, Shan organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 4 givenname: Liumeng surname: Xue fullname: Xue, Liumeng organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 5 givenname: Lei surname: Xie fullname: Xie, Lei email: lxie@nwpu.edu.cn organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China |
BookMark | eNpNUU1rGzEQXUoKTdP8glwEPa-rr9VKR2Pc1JDSwqZnIWlHjsxacrVywf--cjeEzuXNDPPeDPM-NjcxRWiaB4JXhGD1Zb3ZbIdhRTFRK6oIJ1y9a24pEaplHRM3_-Ufmvt5PuAasra6_raZf2Zo11PYxyPEgh7PYYQRrUupVUgR-ZTR7njK6U-Ie_ScTYjXZOt9cAGiuyATR_Q9jTChoRgbplAuKES0jWNbUlsBDScA94KGSywvMIf5U_Pem2mG-1e8a3593T5vvrVPPx53m_VT6ziWpVVKSScYJ0o540QPnnJQYz9KKT2lTnhOseoI9r3vrARBACy1PaZk7Ln37K7ZLbpjMgd9yuFo8kUnE_S_Rsp7bXIJbgLdWSukcEwwh7kAZztMmDHEWsJ4L0zV-rxo1Vf8PsNc9CGdc6zna8q7ThDOBKlTbJlyOc1zBv-2lWB9NUsvZumrWfrVrMp6WFgBAN4YsseMY8H-Aupkkcw |
CODEN | IAECCG |
CitedBy_id | crossref_primary_10_1109_ACCESS_2019_2932750 crossref_primary_10_1109_ACCESS_2022_3175810 crossref_primary_10_1177_1550147720923529 crossref_primary_10_1016_j_csl_2023_101577 crossref_primary_10_55648_1998_6920_2021_15_4_23_31 crossref_primary_10_1016_j_csl_2020_101183 crossref_primary_10_3390_app12031686 crossref_primary_10_3390_e25010041 |
Cites_doi | 10.1109/ICASSP.2018.8461829 10.21437/Interspeech.2017-1452 10.21437/Interspeech.2017-314 10.1017/CBO9780511816338 10.1109/ICASSP.2013.6639215 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019 |
DBID | 97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D DOA |
DOI | 10.1109/ACCESS.2019.2914149 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts METADEX Technology Research Database Materials Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef Materials Research Database Engineered Materials Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace METADEX Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Materials Research Database |
Database_xml | – sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 2169-3536 |
EndPage | 65964 |
ExternalDocumentID | oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a 10_1109_ACCESS_2019_2914149 8703406 |
Genre | orig-research |
GrantInformation_xml | – fundername: Natural Science Foundation of Hebei University of Economics and Business grantid: 2016KYQ05 – fundername: National Basic Research Program of China (973 Program); National Key Research and Development Program of China grantid: 2017YFB1002102 funderid: 10.13039/501100012166 |
GroupedDBID | 0R~ 4.4 5VS 6IK 97E AAJGR ABVLG ACGFS ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD ESBDL GROUPED_DOAJ IFIPE IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RIG RNS AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3 |
IEDL.DBID | DOA |
ISSN | 2169-3536 |
IngestDate | Fri Oct 04 13:12:36 EDT 2024 Thu Oct 10 17:49:48 EDT 2024 Fri Aug 23 00:50:49 EDT 2024 Wed Jun 26 19:27:46 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3 |
ORCID | 0000-0002-8842-7329 |
OpenAccessLink | https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a |
PQID | 2455614361 |
PQPubID | 4845423 |
PageCount | 10 |
ParticipantIDs | ieee_primary_8703406 proquest_journals_2455614361 crossref_primary_10_1109_ACCESS_2019_2914149 doaj_primary_oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a |
PublicationCentury | 2000 |
PublicationDate | 20190000 2019-00-00 20190101 2019-01-01 |
PublicationDateYYYYMMDD | 2019-01-01 |
PublicationDate_xml | – year: 2019 text: 20190000 |
PublicationDecade | 2010 |
PublicationPlace | Piscataway |
PublicationPlace_xml | – name: Piscataway |
PublicationTitle | IEEE access |
PublicationTitleAbbrev | Access |
PublicationYear | 2019 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | skerry-ryan (ref4) 2018 srivastava (ref14) 2014; 15 sutskever (ref21) 2014 chorowski (ref5) 2015 ref2 krueger (ref15) 2016 ref1 arik (ref6) 2017 valin (ref17) 2018 chen (ref18) 2016 ref19 zen (ref20) 2015 chung (ref11) 2018 ref7 bahdanau (ref22) 2015 ref3 bengio (ref16) 2015 ping (ref9) 2018 raffel (ref13) 2017; 70 sotelo (ref8) 2017 li (ref10) 2018 graves (ref12) 2013 |
References_xml | – ident: ref19 doi: 10.1109/ICASSP.2018.8461829 – year: 2018 ident: ref9 article-title: Deep Voice 3: 2000-speaker neural text-to-speech publication-title: Proc Int Conf Learn Represent contributor: fullname: ping – year: 2013 ident: ref12 publication-title: Generating Sequences with Recurrent Neural Networks contributor: fullname: graves – volume: 15 start-page: 1929 year: 2014 ident: ref14 article-title: Dropout: A simple way to prevent neural networks from overfitting publication-title: J Mach Learn Res contributor: fullname: srivastava – ident: ref3 doi: 10.21437/Interspeech.2017-1452 – year: 2015 ident: ref22 publication-title: Neural machine translation by jointly learning to align and translate contributor: fullname: bahdanau – start-page: 3104 year: 2014 ident: ref21 article-title: Sequence to sequence learning with neural networks publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: sutskever – start-page: 1171 year: 2015 ident: ref16 article-title: Scheduled sampling for sequence prediction with recurrent neural networks publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: bengio – start-page: 577 year: 2015 ident: ref5 article-title: Attention-based models for speech recognition publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: chorowski – year: 2018 ident: ref11 publication-title: Semi-supervised training for improving data efficiency in end-to-end speech synthesis contributor: fullname: chung – year: 2016 ident: ref15 publication-title: Zoneout Regularizing RNNs by Randomly Preserving Hidden Activations contributor: fullname: krueger – year: 2016 ident: ref18 publication-title: Guided Alignment Training for Topic-Aware Neural Machine Translation contributor: fullname: chen – start-page: 1 year: 2015 ident: ref20 article-title: Acoustic Modeling in Statistical Parametric Speech Synthesis-From HMM to LSTM-RNN publication-title: Proc MLSLP contributor: fullname: zen – ident: ref7 doi: 10.21437/Interspeech.2017-314 – ident: ref2 doi: 10.1017/CBO9780511816338 – volume: 70 start-page: 2837 year: 2017 ident: ref13 article-title: Online and linear-time attention by enforcing monotonic alignments publication-title: Proc 34th Int Conf Mach Learn contributor: fullname: raffel – year: 2018 ident: ref17 publication-title: LPCNET Improving neural speech synthesis through linear prediction contributor: fullname: valin – year: 2018 ident: ref10 article-title: Close to human quality TTS with transformer publication-title: arXiv 1809 08895 contributor: fullname: li – ident: ref1 doi: 10.1109/ICASSP.2013.6639215 – year: 2017 ident: ref6 publication-title: Deep Voice Real-time neural text-to-speech contributor: fullname: arik – year: 2018 ident: ref4 publication-title: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron contributor: fullname: skerry-ryan – year: 2017 ident: ref8 article-title: Char2Wav: End-to-end speech synthesis publication-title: Proceedings of ICLR workshop submission contributor: fullname: sotelo |
SSID | ssj0000816957 |
Score | 2.2995763 |
Snippet | Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with... |
SourceID | doaj proquest crossref ieee |
SourceType | Open Website Aggregation Database Publisher |
StartPage | 65955 |
SubjectTerms | Acoustics Alignment alignment loss Analytical models Attention Coders Decoding Efficiency Encoders-Decoders Learning model stability Neural networks Phonemes Speech recognition Speech synthesis Stability Stability analysis Task analysis Training training efficiency |
SummonAdditionalLinks | – databaseName: IEEE Electronic Library (IEL) dbid: RIE link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwELUop3JooRR1C1Q-9IgXx19JjstqKUKiqgRI3KzYHsOqKIvY7AF-PbbjjRDl0FOiKInsvLH9PJl5g9BP6bySrLLESaBECMuIqUxJKGssFd5IbqMf8uK3OrsW5zfyZgMdDbkwAJCCz2AcT9O_fLewq-gqOw62xUXU1_5QUdbnag3-lFhAopZlFhYqaH08mU5DH2L0Vj1mdSGKqJf5avFJGv25qMo_M3FaXk4_o4t1w_qokr_jVWfG9vmNZuP_tnwbfco8E096w9hBG9B-QVuv1Ad30fLPI5DJ_fw2BQTgX6u5A4cnXddHQOJAZ_Hgc8BXuZYEniXRiZixiZvW4VhM7R4HzpqibJ_wvMWz1pFuQcIBXz4A2Dt8-dQGormcL7-i69PZ1fSM5BoMxApadSTwx8qqKPpW28aqEjwTULvShZ2zZ8wqL1hiab700lSgCgh7dVMG6uBK4T3fQ5vtooVvCIOrA_vh1HEIMwWlJiVAWAgEknvV-BE6WoOjH3qpDZ22KLTWPZY6YqkzliN0EgEcbo062elC-PA6DzstjVGVslzxYHkKrJGB8jZNYUxMoVXNCO1GsIaXZJxG6GBtDjqP6aVmIpYSFVwV399_ah99jA3sHTQHaLN7XMFhoCyd-ZFs9QVsbekI priority: 102 providerName: IEEE |
Title | Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis |
URI | https://ieeexplore.ieee.org/document/8703406 https://www.proquest.com/docview/2455614361 https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a |
Volume | 7 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iSQ_iE6tVcvBoNJvX7h5rqYqgCLbgLWxeWpBV7PbQf-8kuy0FD148LSz7ysxm8s1k5huELqQLSrLCEic9JUJYRkxhckJZZakIRnIb45CPT-p-Ih5e5etaq6-YE9bSA7eCu5bGqEJZrjjcq7w1EkBLVWXGxCJI1UKjTK45U8kGF5kqZd7RDGW0vB4MhzCimMtVXrEyE1lkz1xbihJjf9di5ZddTovN7S7a6VAiHrRft4c2fL2Ptte4Aw_Q7Pnbk8HH9C1t5-O7-dR5hwdN0-YvYgCjeBUxwOOuEwQeJcqIWG-Jq9rh2ArtAwPiTDmyCzyt8ah2pPkkcMAvX97bd_yyqAEmzqazQzS5HY2H96TroECsoEVDAP0VVkXKttJWVuU-MOFLlzvwewNjVgXBEsYKeZCm8Crz4GmbHBZ-l4sQ-BHarD9rf4ywdyVgF04d9zDPKTWpfMF6gH88qCr00OVSmPqrJcrQycGgpW5lr6PsdSf7HrqJAl9dGlmu0wnQve50r__SfQ8dRHWtHgK2hwNA6aH-Un26m5EzzURsBCq4yk7-49WnaCsOpw3G9NFm8z33ZwBPGnOe_sTzVEn4A2FB4IY |
link.rule.ids | 315,786,790,802,870,2115,4043,27954,27955,27956,55107 |
linkProvider | Directory of Open Access Journals |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pb9MwFLamcQAO_BqIwgAfOM6d419JjqXqKLBOSOuk3azYfh4VUzqt6WH89diOG03AgVOiKInsfM_255f3vofQR-m8kqyyxEmgRAjLiKlMSShrLBXeSG6jH3JxpuYX4uulvNxDR0MuDACk4DMYx9P0L9-t7Ta6yo6DbXER9bUfhHWeln221uBRiSUkallmaaGC1seT6TT0IsZv1WNWF6KIipn3lp-k0p_Lqvw1F6cF5uQpWuya1seV_BxvOzO2v_5Qbfzftj9DTzLTxJPeNJ6jPWhfoMf39AcP0Ob7LZDJ9eoqhQTgz9uVA4cnXdfHQOJAaPHgdcDLXE0Cz5LsRMzZxE3rcCyndo0Da01xtnd41eJZ60i3JuGAz28A7A98ftcGqrlZbV6ii5PZcjonuQoDsYJWHQkMsrIqyr7VtrGqBM8E1K50Ye_sGbPKC5Z4mi-9NBWoAsJu3ZSBPLhSeM9fof123cJrhMHVgf9w6jiEuYJSk1IgLAQKyb1q_Agd7cDRN73Yhk6bFFrrHksdsdQZyxH6FAEcbo1K2elC-PA6DzwtjVGVslzxYHsKrJGB9DZNYUxMolXNCB1EsIaXZJxG6HBnDjqP6o1mIhYTFVwVb_791Af0cL5cnOrTL2ff3qJHsbG9u-YQ7Xe3W3gXCExn3ie7_Q3XIuxc |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Pre-Alignment+Guided+Attention+for+Improving+Training+Efficiency+and+Model+Stability+in+End-to-End+Speech+Synthesis&rft.jtitle=IEEE+access&rft.au=Zhu%2C+Xiaolian&rft.au=Zhang%2C+Yuchao&rft.au=Yang%2C+Shan&rft.au=Xue%2C+Liumeng&rft.date=2019&rft.pub=IEEE&rft.eissn=2169-3536&rft.volume=7&rft.spage=65955&rft.epage=65964&rft_id=info:doi/10.1109%2FACCESS.2019.2914149&rft.externalDocID=8703406 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon |