Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no long...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 7; pp. 65955 - 65964
Main Authors	Zhu, Xiaolian, Zhang, Yuchao, Yang, Shan, Xue, Liumeng, Xie, Lei
Format	Journal Article
Language	English
Published	Piscataway IEEE 2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Alignment alignment loss Analytical models Attention Coders Decoding Efficiency Encoders-Decoders Learning model stability Neural networks Phonemes Speech recognition Speech synthesis Stability Stability analysis Task analysis Training training efficiency
Online Access	Get full text

Cover

Loading…

Abstract	Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.
AbstractList	Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.
Author	Xie, Lei Zhang, Yuchao Zhu, Xiaolian Xue, Liumeng Yang, Shan
Author_xml	– sequence: 1 givenname: Xiaolian orcidid: 0000-0002-8842-7329 surname: Zhu fullname: Zhu, Xiaolian organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 2 givenname: Yuchao surname: Zhang fullname: Zhang, Yuchao organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 3 givenname: Shan surname: Yang fullname: Yang, Shan organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 4 givenname: Liumeng surname: Xue fullname: Xue, Liumeng organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China – sequence: 5 givenname: Lei surname: Xie fullname: Xie, Lei email: lxie@nwpu.edu.cn organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
BookMark	eNpNUU1rGzEQXUoKTdP8glwEPa-rr9VKR2Pc1JDSwqZnIWlHjsxacrVywf--cjeEzuXNDPPeDPM-NjcxRWiaB4JXhGD1Zb3ZbIdhRTFRK6oIJ1y9a24pEaplHRM3_-Ufmvt5PuAasra6_raZf2Zo11PYxyPEgh7PYYQRrUupVUgR-ZTR7njK6U-Ie_ScTYjXZOt9cAGiuyATR_Q9jTChoRgbplAuKES0jWNbUlsBDScA94KGSywvMIf5U_Pem2mG-1e8a3593T5vvrVPPx53m_VT6ziWpVVKSScYJ0o540QPnnJQYz9KKT2lTnhOseoI9r3vrARBACy1PaZk7Ln37K7ZLbpjMgd9yuFo8kUnE_S_Rsp7bXIJbgLdWSukcEwwh7kAZztMmDHEWsJ4L0zV-rxo1Vf8PsNc9CGdc6zna8q7ThDOBKlTbJlyOc1zBv-2lWB9NUsvZumrWfrVrMp6WFgBAN4YsseMY8H-Aupkkcw
CODEN	IAECCG
CitedBy_id	crossref_primary_10_1109_ACCESS_2019_2932750 crossref_primary_10_1109_ACCESS_2022_3175810 crossref_primary_10_1177_1550147720923529 crossref_primary_10_1016_j_csl_2023_101577 crossref_primary_10_55648_1998_6920_2021_15_4_23_31 crossref_primary_10_1016_j_csl_2020_101183 crossref_primary_10_3390_app12031686 crossref_primary_10_3390_e25010041
Cites_doi	10.1109/ICASSP.2018.8461829 10.21437/Interspeech.2017-1452 10.21437/Interspeech.2017-314 10.1017/CBO9780511816338 10.1109/ICASSP.2013.6639215
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
DBID	97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D DOA
DOI	10.1109/ACCESS.2019.2914149
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Engineered Materials Abstracts METADEX Technology Research Database Materials Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef Materials Research Database Engineered Materials Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace METADEX Computer and Information Systems Abstracts Professional
DatabaseTitleList	Materials Research Database
Database_xml	– sequence: 1 dbid: DOA name: DOAJ Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website – sequence: 2 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	2169-3536
EndPage	65964
ExternalDocumentID	oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a 10_1109_ACCESS_2019_2914149 8703406
Genre	orig-research
GrantInformation_xml	– fundername: Natural Science Foundation of Hebei University of Economics and Business grantid: 2016KYQ05 – fundername: National Basic Research Program of China (973 Program); National Key Research and Development Program of China grantid: 2017YFB1002102 funderid: 10.13039/501100012166
GroupedDBID	0R~ 4.4 5VS 6IK 97E AAJGR ABVLG ACGFS ADBBV ALMA_UNASSIGNED_HOLDINGS BCNDV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD ESBDL GROUPED_DOAJ IFIPE IPLJI JAVBF KQ8 M43 M~E O9- OCL OK1 RIA RIE RIG RNS AAYXX CITATION 7SC 7SP 7SR 8BQ 8FD JG9 JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3
IEDL.DBID	DOA
ISSN	2169-3536
IngestDate	Fri Oct 04 13:12:36 EDT 2024 Thu Oct 10 17:49:48 EDT 2024 Fri Aug 23 00:50:49 EDT 2024 Wed Jun 26 19:27:46 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3
ORCID	0000-0002-8842-7329
OpenAccessLink	https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a
PQID	2455614361
PQPubID	4845423
PageCount	10
ParticipantIDs	ieee_primary_8703406 proquest_journals_2455614361 crossref_primary_10_1109_ACCESS_2019_2914149 doaj_primary_oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a
PublicationCentury	2000
PublicationDate	20190000 2019-00-00 20190101 2019-01-01
PublicationDateYYYYMMDD	2019-01-01
PublicationDate_xml	– year: 2019 text: 20190000
PublicationDecade	2010
PublicationPlace	Piscataway
PublicationPlace_xml	– name: Piscataway
PublicationTitle	IEEE access
PublicationTitleAbbrev	Access
PublicationYear	2019
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	skerry-ryan (ref4) 2018 srivastava (ref14) 2014; 15 sutskever (ref21) 2014 chorowski (ref5) 2015 ref2 krueger (ref15) 2016 ref1 arik (ref6) 2017 valin (ref17) 2018 chen (ref18) 2016 ref19 zen (ref20) 2015 chung (ref11) 2018 ref7 bahdanau (ref22) 2015 ref3 bengio (ref16) 2015 ping (ref9) 2018 raffel (ref13) 2017; 70 sotelo (ref8) 2017 li (ref10) 2018 graves (ref12) 2013
References_xml	– ident: ref19 doi: 10.1109/ICASSP.2018.8461829 – year: 2018 ident: ref9 article-title: Deep Voice 3: 2000-speaker neural text-to-speech publication-title: Proc Int Conf Learn Represent contributor: fullname: ping – year: 2013 ident: ref12 publication-title: Generating Sequences with Recurrent Neural Networks contributor: fullname: graves – volume: 15 start-page: 1929 year: 2014 ident: ref14 article-title: Dropout: A simple way to prevent neural networks from overfitting publication-title: J Mach Learn Res contributor: fullname: srivastava – ident: ref3 doi: 10.21437/Interspeech.2017-1452 – year: 2015 ident: ref22 publication-title: Neural machine translation by jointly learning to align and translate contributor: fullname: bahdanau – start-page: 3104 year: 2014 ident: ref21 article-title: Sequence to sequence learning with neural networks publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: sutskever – start-page: 1171 year: 2015 ident: ref16 article-title: Scheduled sampling for sequence prediction with recurrent neural networks publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: bengio – start-page: 577 year: 2015 ident: ref5 article-title: Attention-based models for speech recognition publication-title: Proc Adv Neural Inf Process Syst contributor: fullname: chorowski – year: 2018 ident: ref11 publication-title: Semi-supervised training for improving data efficiency in end-to-end speech synthesis contributor: fullname: chung – year: 2016 ident: ref15 publication-title: Zoneout Regularizing RNNs by Randomly Preserving Hidden Activations contributor: fullname: krueger – year: 2016 ident: ref18 publication-title: Guided Alignment Training for Topic-Aware Neural Machine Translation contributor: fullname: chen – start-page: 1 year: 2015 ident: ref20 article-title: Acoustic Modeling in Statistical Parametric Speech Synthesis-From HMM to LSTM-RNN publication-title: Proc MLSLP contributor: fullname: zen – ident: ref7 doi: 10.21437/Interspeech.2017-314 – ident: ref2 doi: 10.1017/CBO9780511816338 – volume: 70 start-page: 2837 year: 2017 ident: ref13 article-title: Online and linear-time attention by enforcing monotonic alignments publication-title: Proc 34th Int Conf Mach Learn contributor: fullname: raffel – year: 2018 ident: ref17 publication-title: LPCNET Improving neural speech synthesis through linear prediction contributor: fullname: valin – year: 2018 ident: ref10 article-title: Close to human quality TTS with transformer publication-title: arXiv 1809 08895 contributor: fullname: li – ident: ref1 doi: 10.1109/ICASSP.2013.6639215 – year: 2017 ident: ref6 publication-title: Deep Voice Real-time neural text-to-speech contributor: fullname: arik – year: 2018 ident: ref4 publication-title: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron contributor: fullname: skerry-ryan – year: 2017 ident: ref8 article-title: Char2Wav: End-to-end speech synthesis publication-title: Proceedings of ICLR workshop submission contributor: fullname: sotelo
SSID	ssj0000816957
Score	2.2995763
Snippet	Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with...
SourceID	doaj proquest crossref ieee
SourceType	Open Website Aggregation Database Publisher
StartPage	65955
SubjectTerms	Acoustics Alignment alignment loss Analytical models Attention Coders Decoding Efficiency Encoders-Decoders Learning model stability Neural networks Phonemes Speech recognition Speech synthesis Stability Stability analysis Task analysis Training training efficiency
SummonAdditionalLinks	– databaseName: IEEE Electronic Library (IEL) dbid: RIE link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwELUop3JooRR1C1Q-9IgXx19JjstqKUKiqgRI3KzYHsOqKIvY7AF-PbbjjRDl0FOiKInsvLH9PJl5g9BP6bySrLLESaBECMuIqUxJKGssFd5IbqMf8uK3OrsW5zfyZgMdDbkwAJCCz2AcT9O_fLewq-gqOw62xUXU1_5QUdbnag3-lFhAopZlFhYqaH08mU5DH2L0Vj1mdSGKqJf5avFJGv25qMo_M3FaXk4_o4t1w_qokr_jVWfG9vmNZuP_tnwbfco8E096w9hBG9B-QVuv1Ad30fLPI5DJ_fw2BQTgX6u5A4cnXddHQOJAZ_Hgc8BXuZYEniXRiZixiZvW4VhM7R4HzpqibJ_wvMWz1pFuQcIBXz4A2Dt8-dQGormcL7-i69PZ1fSM5BoMxApadSTwx8qqKPpW28aqEjwTULvShZ2zZ8wqL1hiab700lSgCgh7dVMG6uBK4T3fQ5vtooVvCIOrA_vh1HEIMwWlJiVAWAgEknvV-BE6WoOjH3qpDZ22KLTWPZY6YqkzliN0EgEcbo062elC-PA6DzstjVGVslzxYHkKrJGB8jZNYUxMoVXNCO1GsIaXZJxG6GBtDjqP6aVmIpYSFVwV399_ah99jA3sHTQHaLN7XMFhoCyd-ZFs9QVsbekI priority: 102 providerName: IEEE
Title	Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis
URI	https://ieeexplore.ieee.org/document/8703406 https://www.proquest.com/docview/2455614361 https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a
Volume	7
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iSQ_iE6tVcvBoNJvX7h5rqYqgCLbgLWxeWpBV7PbQf-8kuy0FD148LSz7ysxm8s1k5huELqQLSrLCEic9JUJYRkxhckJZZakIRnIb45CPT-p-Ih5e5etaq6-YE9bSA7eCu5bGqEJZrjjcq7w1EkBLVWXGxCJI1UKjTK45U8kGF5kqZd7RDGW0vB4MhzCimMtVXrEyE1lkz1xbihJjf9di5ZddTovN7S7a6VAiHrRft4c2fL2Ptte4Aw_Q7Pnbk8HH9C1t5-O7-dR5hwdN0-YvYgCjeBUxwOOuEwQeJcqIWG-Jq9rh2ArtAwPiTDmyCzyt8ah2pPkkcMAvX97bd_yyqAEmzqazQzS5HY2H96TroECsoEVDAP0VVkXKttJWVuU-MOFLlzvwewNjVgXBEsYKeZCm8Crz4GmbHBZ-l4sQ-BHarD9rf4ywdyVgF04d9zDPKTWpfMF6gH88qCr00OVSmPqrJcrQycGgpW5lr6PsdSf7HrqJAl9dGlmu0wnQve50r__SfQ8dRHWtHgK2hwNA6aH-Un26m5EzzURsBCq4yk7-49WnaCsOpw3G9NFm8z33ZwBPGnOe_sTzVEn4A2FB4IY
link.rule.ids	315,786,790,802,870,2115,4043,27954,27955,27956,55107
linkProvider	Directory of Open Access Journals
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pb9MwFLamcQAO_BqIwgAfOM6d419JjqXqKLBOSOuk3azYfh4VUzqt6WH89diOG03AgVOiKInsfM_255f3vofQR-m8kqyyxEmgRAjLiKlMSShrLBXeSG6jH3JxpuYX4uulvNxDR0MuDACk4DMYx9P0L9-t7Ta6yo6DbXER9bUfhHWeln221uBRiSUkallmaaGC1seT6TT0IsZv1WNWF6KIipn3lp-k0p_Lqvw1F6cF5uQpWuya1seV_BxvOzO2v_5Qbfzftj9DTzLTxJPeNJ6jPWhfoMf39AcP0Ob7LZDJ9eoqhQTgz9uVA4cnXdfHQOJAaPHgdcDLXE0Cz5LsRMzZxE3rcCyndo0Da01xtnd41eJZ60i3JuGAz28A7A98ftcGqrlZbV6ii5PZcjonuQoDsYJWHQkMsrIqyr7VtrGqBM8E1K50Ye_sGbPKC5Z4mi-9NBWoAsJu3ZSBPLhSeM9fof123cJrhMHVgf9w6jiEuYJSk1IgLAQKyb1q_Agd7cDRN73Yhk6bFFrrHksdsdQZyxH6FAEcbo1K2elC-PA6DzwtjVGVslzxYHsKrJGB9DZNYUxMolXNCB1EsIaXZJxG6HBnDjqP6o1mIhYTFVwVb_791Af0cL5cnOrTL2ff3qJHsbG9u-YQ7Xe3W3gXCExn3ie7_Q3XIuxc
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Pre-Alignment+Guided+Attention+for+Improving+Training+Efficiency+and+Model+Stability+in+End-to-End+Speech+Synthesis&rft.jtitle=IEEE+access&rft.au=Zhu%2C+Xiaolian&rft.au=Zhang%2C+Yuchao&rft.au=Yang%2C+Shan&rft.au=Xue%2C+Liumeng&rft.date=2019&rft.pub=IEEE&rft.eissn=2169-3536&rft.volume=7&rft.spage=65955&rft.epage=65964&rft_id=info:doi/10.1109%2FACCESS.2019.2914149&rft.externalDocID=8703406
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon