Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model

This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-qu...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 28; pp. 857 - 861
Main Authors	Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi
Format	Journal Article
Language	English
Published	New York IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Context modeling contextual embedding Decoding end-to-end text-to-speech synthesis Incremental text-to-speech synthesis language model Linguistics Predictive models Speech Speech recognition Speech synthesis Training Tuning
Online Access	Get full text
ISSN	1070-9908 1558-2361
DOI	10.1109/LSP.2021.3073869

Cover

Abstract	This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
AbstractList	This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
Author	Takamichi, Shinnosuke Saruwatari, Hiroshi Saeki, Takaaki
Author_xml	– sequence: 1 givenname: Takaaki orcidid: 0000-0001-6003-768X surname: Saeki fullname: Saeki, Takaaki email: takaaki_saeki@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan – sequence: 2 givenname: Shinnosuke orcidid: 0000-0003-0520-7847 surname: Takamichi fullname: Takamichi, Shinnosuke email: shinnosuke_takamichi@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan – sequence: 3 givenname: Hiroshi orcidid: 0000-0003-0876-5617 surname: Saruwatari fullname: Saruwatari, Hiroshi email: hiroshi_saruwatari@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
BookMark	eNp9kEtLAzEUhYMo2Fb3gpsB11Pz6MwkSyk-CiMW2uJyzCQ37dQ2qUkG7L93SosLF67u5XC--zh9dG6dBYRuCB4SgsV9OZsOKaZkyHDBeC7OUI9kGU8py8l51-MCp0Jgfon6IawxxpzwrIc-JlZ52IKNcpPM4Tum0aWzHYBaJbO9jSsITUgWobHLZBqg1S4pnfuUK5A6eW_iKimlX0Iy9RC9bCzoTrDLVnbaq9OwuUIXRm4CXJ_qAC2eHufjl7R8e56MH8pUMcZiqjONsTbc1EoUBIq6JqIwuaSFYmZUjPK8NlRxXReSScmpocJolY1Y9wdlmWADdHecu_Puq4UQq7Vrve1WVjSjGWeC0Lxz5UeX8i4ED6ZSTZSxcfZw_aYiuDqkWXVpVoc0q1OaHYj_gDvfbKXf_4fcHpEGAH7tYoRzRgX7AW1qgiE
CODEN	ISPLEM
CitedBy_id	crossref_primary_10_1016_j_joi_2023_101453 crossref_primary_10_1109_ACCESS_2023_3251657 crossref_primary_10_1109_TASLP_2022_3196879 crossref_primary_10_1109_TASLP_2024_3369537
Cites_doi	10.21437/SSW.2019-33 10.1007/s00530-014-0446-1 10.18653/v1/P18-1082 10.1609/aaai.v33i01.33016706 10.1109/ICASSP.2017.7953075 10.21437/Interspeech.2020-1822 10.21437/Interspeech.2020-2103 10.21437/Interspeech.2017-1452 10.21437/Interspeech.2018-1456 10.1016/j.specom.2009.04.004 10.1177/1745691610393980 10.1109/ICASSP.2015.7178964 10.1109/ICASSP.2018.8461368 10.1109/ICASSP.2019.8683143
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID	97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/LSP.2021.3073869
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1558-2361
EndPage	861
ExternalDocumentID	10_1109_LSP_2021_3073869 9406329
Genre	orig-research
GrantInformation_xml	– fundername: JSPS KAKENHI grantid: 17H06101; 19H01116; MIC/SCOPE #182103104
GroupedDBID	-~X .DC 0R~ 29I 3EH 4.4 5GY 5VS 6IK 85S 97E AAJGR AARMG AASAJ AAWTH AAYJJ ABAZT ABFSI ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD ESBDL F5P HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593
IEDL.DBID	RIE
ISSN	1070-9908
IngestDate	Sun Jun 29 17:00:53 EDT 2025 Thu Apr 24 22:56:18 EDT 2025 Tue Jul 01 02:21:36 EDT 2025 Wed Aug 27 02:29:30 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Language	English
License	https://creativecommons.org/licenses/by/4.0/legalcode
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0003-0876-5617 0000-0003-0520-7847 0000-0001-6003-768X
OpenAccessLink	https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/9406329
PQID	2525839126
PQPubID	75747
PageCount	5
ParticipantIDs	crossref_citationtrail_10_1109_LSP_2021_3073869 crossref_primary_10_1109_LSP_2021_3073869 ieee_primary_9406329 proquest_journals_2525839126
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	20210000 2021-00-00 20210101
PublicationDateYYYYMMDD	2021-01-01
PublicationDate_xml	– year: 2021 text: 20210000
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE signal processing letters
PublicationTitleAbbrev	LSP
PublicationYear	2021
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref13 ref12 ref15 ref14 tokuda (ref7) 0 ref11 ref10 ref17 sudoh (ref2) 2020 cong (ref6) 2020 radford (ref3) 2019 bangalore (ref1) 0 ochshorn (ref19) 2017 ref24 ref23 ref26 ito (ref18) 2017 ref25 ref22 wang (ref16) 0 kingma (ref20) 2014 levenshtein (ref21) 1966; 10 zen (ref9) 0 ref8 ref5 ma (ref4) 0
References_xml	– start-page: 227 year: 0 ident: ref7 article-title: An HMM-based speech synthesis system applied to english publication-title: Proc IEEE Workshop Speech Synthesis – start-page: 437 year: 0 ident: ref1 article-title: Real-time incremental speech-to-speech translation of dialogs publication-title: Proc Conf North Amer Chapter Assoc Comput Linguist Human Lang Technol – ident: ref12 doi: 10.21437/SSW.2019-33 – ident: ref25 doi: 10.1007/s00530-014-0446-1 – ident: ref17 doi: 10.18653/v1/P18-1082 – ident: ref11 doi: 10.1609/aaai.v33i01.33016706 – year: 2019 ident: ref3 article-title: Language models are unsupervised multitask learners – year: 2020 ident: ref2 article-title: Simultaneous speech-to-speech translation system with neural incremental ASR MT, and TTS – ident: ref22 doi: 10.1109/ICASSP.2017.7953075 – start-page: 3886 year: 0 ident: ref4 article-title: Incremental text-to-speech synthesis with prefix-to-prefix framework publication-title: Proc Empirical Methods Natural Lang Process Online – ident: ref14 doi: 10.21437/Interspeech.2020-1822 – year: 2017 ident: ref18 article-title: The LJ Speech Dataset – ident: ref13 doi: 10.21437/Interspeech.2020-2103 – start-page: 5180 year: 0 ident: ref16 article-title: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis publication-title: Proc Int Conf Mach Learn (ICML) – start-page: 7962 year: 0 ident: ref9 article-title: Statistical parametric speech synthesis using deep neural networks publication-title: Proc IEEE Int Conf Acoust Speech Signal Process – volume: 10 start-page: 707 year: 1966 ident: ref21 article-title: Binary codes capable of correcting deletions, insertions and reversals publication-title: Sov Phys Doklady – ident: ref10 doi: 10.21437/Interspeech.2017-1452 – ident: ref24 doi: 10.21437/Interspeech.2018-1456 – ident: ref8 doi: 10.1016/j.specom.2009.04.004 – year: 2014 ident: ref20 article-title: Adam: A method for stochastic optimization – ident: ref26 doi: 10.1177/1745691610393980 – ident: ref23 doi: 10.1109/ICASSP.2015.7178964 – ident: ref5 doi: 10.1109/ICASSP.2018.8461368 – year: 2020 ident: ref6 article-title: PPSpeech: Phrase based parallel end-to-end TTS system – ident: ref15 doi: 10.1109/ICASSP.2019.8683143 – year: 2017 ident: ref19 article-title: Gentle: A Robust Yet Lenient Forced Aligner Built on Kaldi
SSID	ssj0008185
Score	2.3802333
Snippet	This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	857
SubjectTerms	Context modeling contextual embedding Decoding end-to-end text-to-speech synthesis Incremental text-to-speech synthesis language model Linguistics Predictive models Speech Speech recognition Speech synthesis Training Tuning
Title	Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model
URI	https://ieeexplore.ieee.org/document/9406329 https://www.proquest.com/docview/2525839126
Volume	28
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4Bp3KgD0BsSysfeqlU7yZ-xPGxQiBULdVKC4JbiO2xQKBdxGYP8Os7TrIr1KKqtyiaiSx_k3l5PAPwNchahww1L7SsucIy8lJjzvPonTTWOu3S5eSzX8Xphfp5pa824Pv6LgwitsVnOEyP7Vl-mPtlSpWNLFkfKewmbJKYdXe11lo3GZ6uvjDjpGHL1ZFkZkfj6YQCQZEPkzyXqbT5hQlqZ6r8pYhb63LyFs5W6-qKSu6Gy8YN_fMfLRv_d-HvYKd3M9mPTi7ewwbOPsD2i-aDu3BNqqFLDhLheQqAmzmfPiD6GzZ9mpFjuLhdsLakgE0WuAxzNiaPvCbtHdjlbXPDxqmKnE0esZ0zgYFedMlPlias3e_Bxcnx-dEp7-ctcC-lbHgg1LIQy-i8NTka53JrYlEL42VURhWFi8KXwZla1nUporAxeK0kbb6Q2sp92JrNZ3gATCovC6dN5mKmlLHEoBSScxPzzAusBzBaQVD5vhl5Wut91QYlma0ItCqBVvWgDeDbmuOha8TxD9rdhMGart_-ARyuUK76P3VRCS00OYm5KD6-zvUJ3qRvd2mXQ9hqHpf4mRyRxn1pJfA3L6vaFQ
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED4xeNj2MDbYtA62-WEvSHOb-EccPyIEKixFlVo03rLYPgs01CKaPsBfj52kFdqmibco8imWP-fufP7uDuCb45V0CUqaSV5RgbmnucSUpt4arrQ20sTk5NF5NrwQZ5fycgO-r3NhELEhn2E_PjZ3-W5ulzFUNtDB-nCmX8BWsPtCttlaa70bTU_LMExo0LH56lIy0YNiMg5HQZb2447OI7n5iRFquqr8pYob-3KyDaPVzFpaye_-sjZ9-_BH0cbnTv0tvOkcTXLY7ox3sIGzHXj9pPzgLvwKyqEND4aB03gErud0cotor8jkfhZcw8X1gjSkAjJe4NLNSRF88irob0d-XtdXpIg8cjK-w6bTBLrwog1_kthj7eY9XJwcT4-GtOu4QC3nvKYu4JY4n3tjtUpRGZNq5bOKKcu9UCLLjGc2d0ZVvKpy5pn2zkrBw-IzLjX_AJuz-Qw_AuHC8sxIlRifCKF0EBACg3vj08QyrHowWEFQ2q4ceZzrTdkcSxJdBtDKCFrZgdaDg7XEbVuK4z9jdyMG63Hd8vdgf4Vy2f2ri5JJJoObmLLs07-lvsLL4XRUlMXp-Y89eBW_0wZh9mGzvlvi5-CW1OZLsxsfAcRb3WI
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Incremental+Text-to-Speech+Synthesis+Using+Pseudo+Lookahead+With+Large+Pretrained+Language+Model&rft.jtitle=IEEE+signal+processing+letters&rft.au=Saeki%2C+Takaaki&rft.au=Takamichi%2C+Shinnosuke&rft.au=Saruwatari%2C+Hiroshi&rft.date=2021&rft.pub=IEEE&rft.issn=1070-9908&rft.volume=28&rft.spage=857&rft.epage=861&rft_id=info:doi/10.1109%2FLSP.2021.3073869&rft.externalDocID=9406329
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1070-9908&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1070-9908&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1070-9908&client=summon