Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis

Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no long...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 7; pp. 65955 - 65964
Main Authors Zhu, Xiaolian, Zhang, Yuchao, Yang, Shan, Xue, Liumeng, Xie, Lei
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.
AbstractList Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with both simplified system building pipelines and high-quality speech. With a unique encoder-decoder neural structure, the Tacotron2 system no longer needs separately learned text analysis front-end, duration model, acoustic model, and audio synthesis module. The key of such a system lies in the attention mechanism, which learns an alignment between the encoder and the decoder, serving as an implicit duration model bridging the text sequence and the acoustic sequence. However, attention learning suffers from low training efficiency and model instability problems, which hinder the E2E approaches from wide deployment. In this paper, we address the problems and propose a novel pre-alignment guided attention learning approach. Specifically, we inject handy prior knowledge-accurate phoneme durations-in the neural network loss function to bias the attention learning to the desired direction more accurately. The explicit time alignment between an audio recording and its corresponding phoneme sequence can be achieved by forced-alignment from an automatic speech recognizer (ASR). The experiments show that the proposed pre-alignment guided (PAG) attention approach can significantly improve training efficiency and model stability. More specifically, the PAG updated version of Tacotron2 can quickly obtain the attention alignment using only 500 (text, audio) pairs, which is apparently not possible for the original Tacotron2. A series of subjective experiments also show that the PAG-Tacotron2 approach can synthesize more stable and natural speech.
Author Xie, Lei
Zhang, Yuchao
Zhu, Xiaolian
Xue, Liumeng
Yang, Shan
Author_xml – sequence: 1
  givenname: Xiaolian
  orcidid: 0000-0002-8842-7329
  surname: Zhu
  fullname: Zhu, Xiaolian
  organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
– sequence: 2
  givenname: Yuchao
  surname: Zhang
  fullname: Zhang, Yuchao
  organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
– sequence: 3
  givenname: Shan
  surname: Yang
  fullname: Yang, Shan
  organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
– sequence: 4
  givenname: Liumeng
  surname: Xue
  fullname: Xue, Liumeng
  organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
– sequence: 5
  givenname: Lei
  surname: Xie
  fullname: Xie, Lei
  email: lxie@nwpu.edu.cn
  organization: School of Computer Science, Northwestern Polytechnical University, Xi'an, China
BookMark eNpNUU1rGzEQXUoKTdP8glwEPa-rr9VKR2Pc1JDSwqZnIWlHjsxacrVywf--cjeEzuXNDPPeDPM-NjcxRWiaB4JXhGD1Zb3ZbIdhRTFRK6oIJ1y9a24pEaplHRM3_-Ufmvt5PuAasra6_raZf2Zo11PYxyPEgh7PYYQRrUupVUgR-ZTR7njK6U-Ie_ScTYjXZOt9cAGiuyATR_Q9jTChoRgbplAuKES0jWNbUlsBDScA94KGSywvMIf5U_Pem2mG-1e8a3593T5vvrVPPx53m_VT6ziWpVVKSScYJ0o540QPnnJQYz9KKT2lTnhOseoI9r3vrARBACy1PaZk7Ln37K7ZLbpjMgd9yuFo8kUnE_S_Rsp7bXIJbgLdWSukcEwwh7kAZztMmDHEWsJ4L0zV-rxo1Vf8PsNc9CGdc6zna8q7ThDOBKlTbJlyOc1zBv-2lWB9NUsvZumrWfrVrMp6WFgBAN4YsseMY8H-Aupkkcw
CODEN IAECCG
CitedBy_id crossref_primary_10_1109_ACCESS_2019_2932750
crossref_primary_10_1109_ACCESS_2022_3175810
crossref_primary_10_1177_1550147720923529
crossref_primary_10_1016_j_csl_2023_101577
crossref_primary_10_55648_1998_6920_2021_15_4_23_31
crossref_primary_10_1016_j_csl_2020_101183
crossref_primary_10_3390_app12031686
crossref_primary_10_3390_e25010041
Cites_doi 10.1109/ICASSP.2018.8461829
10.21437/Interspeech.2017-1452
10.21437/Interspeech.2017-314
10.1017/CBO9780511816338
10.1109/ICASSP.2013.6639215
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
DBID 97E
ESBDL
RIA
RIE
AAYXX
CITATION
7SC
7SP
7SR
8BQ
8FD
JG9
JQ2
L7M
L~C
L~D
DOA
DOI 10.1109/ACCESS.2019.2914149
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE Xplore Open Access Journals
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Engineered Materials Abstracts
METADEX
Technology Research Database
Materials Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DOAJ Directory of Open Access Journals
DatabaseTitle CrossRef
Materials Research Database
Engineered Materials Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
METADEX
Computer and Information Systems Abstracts Professional
DatabaseTitleList

Materials Research Database
Database_xml – sequence: 1
  dbid: DOA
  name: DOAJ Directory of Open Access Journals
  url: https://www.doaj.org/
  sourceTypes: Open Website
– sequence: 2
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2169-3536
EndPage 65964
ExternalDocumentID oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a
10_1109_ACCESS_2019_2914149
8703406
Genre orig-research
GrantInformation_xml – fundername: Natural Science Foundation of Hebei University of Economics and Business
  grantid: 2016KYQ05
– fundername: National Basic Research Program of China (973 Program); National Key Research and Development Program of China
  grantid: 2017YFB1002102
  funderid: 10.13039/501100012166
GroupedDBID 0R~
4.4
5VS
6IK
97E
AAJGR
ABVLG
ACGFS
ADBBV
ALMA_UNASSIGNED_HOLDINGS
BCNDV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
ESBDL
GROUPED_DOAJ
IFIPE
IPLJI
JAVBF
KQ8
M43
M~E
O9-
OCL
OK1
RIA
RIE
RIG
RNS
AAYXX
CITATION
7SC
7SP
7SR
8BQ
8FD
JG9
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3
IEDL.DBID DOA
ISSN 2169-3536
IngestDate Fri Oct 04 13:12:36 EDT 2024
Thu Oct 10 17:49:48 EDT 2024
Fri Aug 23 00:50:49 EDT 2024
Wed Jun 26 19:27:46 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c408t-9998c634199cac67ef24e9d7d888f22c6f4209510f7f5b8e61eeb2b7021d74ff3
ORCID 0000-0002-8842-7329
OpenAccessLink https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a
PQID 2455614361
PQPubID 4845423
PageCount 10
ParticipantIDs ieee_primary_8703406
proquest_journals_2455614361
crossref_primary_10_1109_ACCESS_2019_2914149
doaj_primary_oai_doaj_org_article_5bb686c363c046ecb5013aa1bb13476a
PublicationCentury 2000
PublicationDate 20190000
2019-00-00
20190101
2019-01-01
PublicationDateYYYYMMDD 2019-01-01
PublicationDate_xml – year: 2019
  text: 20190000
PublicationDecade 2010
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE access
PublicationTitleAbbrev Access
PublicationYear 2019
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References skerry-ryan (ref4) 2018
srivastava (ref14) 2014; 15
sutskever (ref21) 2014
chorowski (ref5) 2015
ref2
krueger (ref15) 2016
ref1
arik (ref6) 2017
valin (ref17) 2018
chen (ref18) 2016
ref19
zen (ref20) 2015
chung (ref11) 2018
ref7
bahdanau (ref22) 2015
ref3
bengio (ref16) 2015
ping (ref9) 2018
raffel (ref13) 2017; 70
sotelo (ref8) 2017
li (ref10) 2018
graves (ref12) 2013
References_xml – ident: ref19
  doi: 10.1109/ICASSP.2018.8461829
– year: 2018
  ident: ref9
  article-title: Deep Voice 3: 2000-speaker neural text-to-speech
  publication-title: Proc Int Conf Learn Represent
  contributor:
    fullname: ping
– year: 2013
  ident: ref12
  publication-title: Generating Sequences with Recurrent Neural Networks
  contributor:
    fullname: graves
– volume: 15
  start-page: 1929
  year: 2014
  ident: ref14
  article-title: Dropout: A simple way to prevent neural networks from overfitting
  publication-title: J Mach Learn Res
  contributor:
    fullname: srivastava
– ident: ref3
  doi: 10.21437/Interspeech.2017-1452
– year: 2015
  ident: ref22
  publication-title: Neural machine translation by jointly learning to align and translate
  contributor:
    fullname: bahdanau
– start-page: 3104
  year: 2014
  ident: ref21
  article-title: Sequence to sequence learning with neural networks
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: sutskever
– start-page: 1171
  year: 2015
  ident: ref16
  article-title: Scheduled sampling for sequence prediction with recurrent neural networks
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: bengio
– start-page: 577
  year: 2015
  ident: ref5
  article-title: Attention-based models for speech recognition
  publication-title: Proc Adv Neural Inf Process Syst
  contributor:
    fullname: chorowski
– year: 2018
  ident: ref11
  publication-title: Semi-supervised training for improving data efficiency in end-to-end speech synthesis
  contributor:
    fullname: chung
– year: 2016
  ident: ref15
  publication-title: Zoneout Regularizing RNNs by Randomly Preserving Hidden Activations
  contributor:
    fullname: krueger
– year: 2016
  ident: ref18
  publication-title: Guided Alignment Training for Topic-Aware Neural Machine Translation
  contributor:
    fullname: chen
– start-page: 1
  year: 2015
  ident: ref20
  article-title: Acoustic Modeling in Statistical Parametric Speech Synthesis-From HMM to LSTM-RNN
  publication-title: Proc MLSLP
  contributor:
    fullname: zen
– ident: ref7
  doi: 10.21437/Interspeech.2017-314
– ident: ref2
  doi: 10.1017/CBO9780511816338
– volume: 70
  start-page: 2837
  year: 2017
  ident: ref13
  article-title: Online and linear-time attention by enforcing monotonic alignments
  publication-title: Proc 34th Int Conf Mach Learn
  contributor:
    fullname: raffel
– year: 2018
  ident: ref17
  publication-title: LPCNET Improving neural speech synthesis through linear prediction
  contributor:
    fullname: valin
– year: 2018
  ident: ref10
  article-title: Close to human quality TTS with transformer
  publication-title: arXiv 1809 08895
  contributor:
    fullname: li
– ident: ref1
  doi: 10.1109/ICASSP.2013.6639215
– year: 2017
  ident: ref6
  publication-title: Deep Voice Real-time neural text-to-speech
  contributor:
    fullname: arik
– year: 2018
  ident: ref4
  publication-title: Towards end-to-end prosody transfer for expressive speech synthesis with tacotron
  contributor:
    fullname: skerry-ryan
– year: 2017
  ident: ref8
  article-title: Char2Wav: End-to-end speech synthesis
  publication-title: Proceedings of ICLR workshop submission
  contributor:
    fullname: sotelo
SSID ssj0000816957
Score 2.2995763
Snippet Recently, end-to-end (E2E) neural text-to-speech systems, such as Tacotron2, have begun to surpass the traditional multi-stage hand-engineered systems, with...
SourceID doaj
proquest
crossref
ieee
SourceType Open Website
Aggregation Database
Publisher
StartPage 65955
SubjectTerms Acoustics
Alignment
alignment loss
Analytical models
Attention
Coders
Decoding
Efficiency
Encoders-Decoders
Learning
model stability
Neural networks
Phonemes
Speech recognition
Speech synthesis
Stability
Stability analysis
Task analysis
Training
training efficiency
SummonAdditionalLinks – databaseName: IEEE Electronic Library (IEL)
  dbid: RIE
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwELUop3JooRR1C1Q-9IgXx19JjstqKUKiqgRI3KzYHsOqKIvY7AF-PbbjjRDl0FOiKInsvLH9PJl5g9BP6bySrLLESaBECMuIqUxJKGssFd5IbqMf8uK3OrsW5zfyZgMdDbkwAJCCz2AcT9O_fLewq-gqOw62xUXU1_5QUdbnag3-lFhAopZlFhYqaH08mU5DH2L0Vj1mdSGKqJf5avFJGv25qMo_M3FaXk4_o4t1w_qokr_jVWfG9vmNZuP_tnwbfco8E096w9hBG9B-QVuv1Ad30fLPI5DJ_fw2BQTgX6u5A4cnXddHQOJAZ_Hgc8BXuZYEniXRiZixiZvW4VhM7R4HzpqibJ_wvMWz1pFuQcIBXz4A2Dt8-dQGormcL7-i69PZ1fSM5BoMxApadSTwx8qqKPpW28aqEjwTULvShZ2zZ8wqL1hiab700lSgCgh7dVMG6uBK4T3fQ5vtooVvCIOrA_vh1HEIMwWlJiVAWAgEknvV-BE6WoOjH3qpDZ22KLTWPZY6YqkzliN0EgEcbo062elC-PA6DzstjVGVslzxYHkKrJGB8jZNYUxMoVXNCO1GsIaXZJxG6GBtDjqP6aVmIpYSFVwV399_ah99jA3sHTQHaLN7XMFhoCyd-ZFs9QVsbekI
  priority: 102
  providerName: IEEE
Title Pre-Alignment Guided Attention for Improving Training Efficiency and Model Stability in End-to-End Speech Synthesis
URI https://ieeexplore.ieee.org/document/8703406
https://www.proquest.com/docview/2455614361
https://doaj.org/article/5bb686c363c046ecb5013aa1bb13476a
Volume 7
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV1LSwMxEA7iSQ_iE6tVcvBoNJvX7h5rqYqgCLbgLWxeWpBV7PbQf-8kuy0FD148LSz7ysxm8s1k5huELqQLSrLCEic9JUJYRkxhckJZZakIRnIb45CPT-p-Ih5e5etaq6-YE9bSA7eCu5bGqEJZrjjcq7w1EkBLVWXGxCJI1UKjTK45U8kGF5kqZd7RDGW0vB4MhzCimMtVXrEyE1lkz1xbihJjf9di5ZddTovN7S7a6VAiHrRft4c2fL2Ptte4Aw_Q7Pnbk8HH9C1t5-O7-dR5hwdN0-YvYgCjeBUxwOOuEwQeJcqIWG-Jq9rh2ArtAwPiTDmyCzyt8ah2pPkkcMAvX97bd_yyqAEmzqazQzS5HY2H96TroECsoEVDAP0VVkXKttJWVuU-MOFLlzvwewNjVgXBEsYKeZCm8Crz4GmbHBZ-l4sQ-BHarD9rf4ywdyVgF04d9zDPKTWpfMF6gH88qCr00OVSmPqrJcrQycGgpW5lr6PsdSf7HrqJAl9dGlmu0wnQve50r__SfQ8dRHWtHgK2hwNA6aH-Un26m5EzzURsBCq4yk7-49WnaCsOpw3G9NFm8z33ZwBPGnOe_sTzVEn4A2FB4IY
link.rule.ids 315,786,790,802,870,2115,4043,27954,27955,27956,55107
linkProvider Directory of Open Access Journals
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3Pb9MwFLamcQAO_BqIwgAfOM6d419JjqXqKLBOSOuk3azYfh4VUzqt6WH89diOG03AgVOiKInsfM_255f3vofQR-m8kqyyxEmgRAjLiKlMSShrLBXeSG6jH3JxpuYX4uulvNxDR0MuDACk4DMYx9P0L9-t7Ta6yo6DbXER9bUfhHWeln221uBRiSUkallmaaGC1seT6TT0IsZv1WNWF6KIipn3lp-k0p_Lqvw1F6cF5uQpWuya1seV_BxvOzO2v_5Qbfzftj9DTzLTxJPeNJ6jPWhfoMf39AcP0Ob7LZDJ9eoqhQTgz9uVA4cnXdfHQOJAaPHgdcDLXE0Cz5LsRMzZxE3rcCyndo0Da01xtnd41eJZ60i3JuGAz28A7A98ftcGqrlZbV6ii5PZcjonuQoDsYJWHQkMsrIqyr7VtrGqBM8E1K50Ye_sGbPKC5Z4mi-9NBWoAsJu3ZSBPLhSeM9fof123cJrhMHVgf9w6jiEuYJSk1IgLAQKyb1q_Agd7cDRN73Yhk6bFFrrHksdsdQZyxH6FAEcbo1K2elC-PA6DzwtjVGVslzxYHsKrJGB9DZNYUxMolXNCB1EsIaXZJxG6HBnDjqP6o1mIhYTFVwVb_791Af0cL5cnOrTL2ff3qJHsbG9u-YQ7Xe3W3gXCExn3ie7_Q3XIuxc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Pre-Alignment+Guided+Attention+for+Improving+Training+Efficiency+and+Model+Stability+in+End-to-End+Speech+Synthesis&rft.jtitle=IEEE+access&rft.au=Zhu%2C+Xiaolian&rft.au=Zhang%2C+Yuchao&rft.au=Yang%2C+Shan&rft.au=Xue%2C+Liumeng&rft.date=2019&rft.pub=IEEE&rft.eissn=2169-3536&rft.volume=7&rft.spage=65955&rft.epage=65964&rft_id=info:doi/10.1109%2FACCESS.2019.2914149&rft.externalDocID=8703406
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2169-3536&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2169-3536&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2169-3536&client=summon