Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model

This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-qu...

Full description

Saved in:
Bibliographic Details
Published inIEEE signal processing letters Vol. 28; pp. 857 - 861
Main Authors Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi
Format Journal Article
LanguageEnglish
Published New York IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1070-9908
1558-2361
DOI10.1109/LSP.2021.3073869

Cover

Abstract This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
AbstractList This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
Author Takamichi, Shinnosuke
Saruwatari, Hiroshi
Saeki, Takaaki
Author_xml – sequence: 1
  givenname: Takaaki
  orcidid: 0000-0001-6003-768X
  surname: Saeki
  fullname: Saeki, Takaaki
  email: takaaki_saeki@ipc.i.u-tokyo.ac.jp
  organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
– sequence: 2
  givenname: Shinnosuke
  orcidid: 0000-0003-0520-7847
  surname: Takamichi
  fullname: Takamichi, Shinnosuke
  email: shinnosuke_takamichi@ipc.i.u-tokyo.ac.jp
  organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
– sequence: 3
  givenname: Hiroshi
  orcidid: 0000-0003-0876-5617
  surname: Saruwatari
  fullname: Saruwatari, Hiroshi
  email: hiroshi_saruwatari@ipc.i.u-tokyo.ac.jp
  organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
BookMark eNp9kEtLAzEUhYMo2Fb3gpsB11Pz6MwkSyk-CiMW2uJyzCQ37dQ2qUkG7L93SosLF67u5XC--zh9dG6dBYRuCB4SgsV9OZsOKaZkyHDBeC7OUI9kGU8py8l51-MCp0Jgfon6IawxxpzwrIc-JlZ52IKNcpPM4Tum0aWzHYBaJbO9jSsITUgWobHLZBqg1S4pnfuUK5A6eW_iKimlX0Iy9RC9bCzoTrDLVnbaq9OwuUIXRm4CXJ_qAC2eHufjl7R8e56MH8pUMcZiqjONsTbc1EoUBIq6JqIwuaSFYmZUjPK8NlRxXReSScmpocJolY1Y9wdlmWADdHecu_Puq4UQq7Vrve1WVjSjGWeC0Lxz5UeX8i4ED6ZSTZSxcfZw_aYiuDqkWXVpVoc0q1OaHYj_gDvfbKXf_4fcHpEGAH7tYoRzRgX7AW1qgiE
CODEN ISPLEM
CitedBy_id crossref_primary_10_1016_j_joi_2023_101453
crossref_primary_10_1109_ACCESS_2023_3251657
crossref_primary_10_1109_TASLP_2022_3196879
crossref_primary_10_1109_TASLP_2024_3369537
Cites_doi 10.21437/SSW.2019-33
10.1007/s00530-014-0446-1
10.18653/v1/P18-1082
10.1609/aaai.v33i01.33016706
10.1109/ICASSP.2017.7953075
10.21437/Interspeech.2020-1822
10.21437/Interspeech.2020-2103
10.21437/Interspeech.2017-1452
10.21437/Interspeech.2018-1456
10.1016/j.specom.2009.04.004
10.1177/1745691610393980
10.1109/ICASSP.2015.7178964
10.1109/ICASSP.2018.8461368
10.1109/ICASSP.2019.8683143
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID 97E
ESBDL
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/LSP.2021.3073869
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE Xplore Open Access Journals
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2361
EndPage 861
ExternalDocumentID 10_1109_LSP_2021_3073869
9406329
Genre orig-research
GrantInformation_xml – fundername: JSPS KAKENHI
  grantid: 17H06101; 19H01116; MIC/SCOPE #182103104
GroupedDBID -~X
.DC
0R~
29I
3EH
4.4
5GY
5VS
6IK
85S
97E
AAJGR
AARMG
AASAJ
AAWTH
AAYJJ
ABAZT
ABFSI
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
ESBDL
F5P
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RNS
TAE
TN5
VH1
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593
IEDL.DBID RIE
ISSN 1070-9908
IngestDate Sun Jun 29 17:00:53 EDT 2025
Thu Apr 24 22:56:18 EDT 2025
Tue Jul 01 02:21:36 EDT 2025
Wed Aug 27 02:29:30 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
License https://creativecommons.org/licenses/by/4.0/legalcode
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0876-5617
0000-0003-0520-7847
0000-0001-6003-768X
OpenAccessLink https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/9406329
PQID 2525839126
PQPubID 75747
PageCount 5
ParticipantIDs crossref_citationtrail_10_1109_LSP_2021_3073869
crossref_primary_10_1109_LSP_2021_3073869
ieee_primary_9406329
proquest_journals_2525839126
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 20210000
2021-00-00
20210101
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – year: 2021
  text: 20210000
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE signal processing letters
PublicationTitleAbbrev LSP
PublicationYear 2021
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
ref14
tokuda (ref7) 0
ref11
ref10
ref17
sudoh (ref2) 2020
cong (ref6) 2020
radford (ref3) 2019
bangalore (ref1) 0
ochshorn (ref19) 2017
ref24
ref23
ref26
ito (ref18) 2017
ref25
ref22
wang (ref16) 0
kingma (ref20) 2014
levenshtein (ref21) 1966; 10
zen (ref9) 0
ref8
ref5
ma (ref4) 0
References_xml – start-page: 227
  year: 0
  ident: ref7
  article-title: An HMM-based speech synthesis system applied to english
  publication-title: Proc IEEE Workshop Speech Synthesis
– start-page: 437
  year: 0
  ident: ref1
  article-title: Real-time incremental speech-to-speech translation of dialogs
  publication-title: Proc Conf North Amer Chapter Assoc Comput Linguist Human Lang Technol
– ident: ref12
  doi: 10.21437/SSW.2019-33
– ident: ref25
  doi: 10.1007/s00530-014-0446-1
– ident: ref17
  doi: 10.18653/v1/P18-1082
– ident: ref11
  doi: 10.1609/aaai.v33i01.33016706
– year: 2019
  ident: ref3
  article-title: Language models are unsupervised multitask learners
– year: 2020
  ident: ref2
  article-title: Simultaneous speech-to-speech translation system with neural incremental ASR MT, and TTS
– ident: ref22
  doi: 10.1109/ICASSP.2017.7953075
– start-page: 3886
  year: 0
  ident: ref4
  article-title: Incremental text-to-speech synthesis with prefix-to-prefix framework
  publication-title: Proc Empirical Methods Natural Lang Process Online
– ident: ref14
  doi: 10.21437/Interspeech.2020-1822
– year: 2017
  ident: ref18
  article-title: The LJ Speech Dataset
– ident: ref13
  doi: 10.21437/Interspeech.2020-2103
– start-page: 5180
  year: 0
  ident: ref16
  article-title: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis
  publication-title: Proc Int Conf Mach Learn (ICML)
– start-page: 7962
  year: 0
  ident: ref9
  article-title: Statistical parametric speech synthesis using deep neural networks
  publication-title: Proc IEEE Int Conf Acoust Speech Signal Process
– volume: 10
  start-page: 707
  year: 1966
  ident: ref21
  article-title: Binary codes capable of correcting deletions, insertions and reversals
  publication-title: Sov Phys Doklady
– ident: ref10
  doi: 10.21437/Interspeech.2017-1452
– ident: ref24
  doi: 10.21437/Interspeech.2018-1456
– ident: ref8
  doi: 10.1016/j.specom.2009.04.004
– year: 2014
  ident: ref20
  article-title: Adam: A method for stochastic optimization
– ident: ref26
  doi: 10.1177/1745691610393980
– ident: ref23
  doi: 10.1109/ICASSP.2015.7178964
– ident: ref5
  doi: 10.1109/ICASSP.2018.8461368
– year: 2020
  ident: ref6
  article-title: PPSpeech: Phrase based parallel end-to-end TTS system
– ident: ref15
  doi: 10.1109/ICASSP.2019.8683143
– year: 2017
  ident: ref19
  article-title: Gentle: A Robust Yet Lenient Forced Aligner Built on Kaldi
SSID ssj0008185
Score 2.3802333
Snippet This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 857
SubjectTerms Context modeling
contextual embedding
Decoding
end-to-end text-to-speech synthesis
Incremental text-to-speech synthesis
language model
Linguistics
Predictive models
Speech
Speech recognition
Speech synthesis
Training
Tuning
Title Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model
URI https://ieeexplore.ieee.org/document/9406329
https://www.proquest.com/docview/2525839126
Volume 28
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4Bp3KgD0BsSysfeqlU7yZ-xPGxQiBULdVKC4JbiO2xQKBdxGYP8Os7TrIr1KKqtyiaiSx_k3l5PAPwNchahww1L7SsucIy8lJjzvPonTTWOu3S5eSzX8Xphfp5pa824Pv6LgwitsVnOEyP7Vl-mPtlSpWNLFkfKewmbJKYdXe11lo3GZ6uvjDjpGHL1ZFkZkfj6YQCQZEPkzyXqbT5hQlqZ6r8pYhb63LyFs5W6-qKSu6Gy8YN_fMfLRv_d-HvYKd3M9mPTi7ewwbOPsD2i-aDu3BNqqFLDhLheQqAmzmfPiD6GzZ9mpFjuLhdsLakgE0WuAxzNiaPvCbtHdjlbXPDxqmKnE0esZ0zgYFedMlPlias3e_Bxcnx-dEp7-ctcC-lbHgg1LIQy-i8NTka53JrYlEL42VURhWFi8KXwZla1nUporAxeK0kbb6Q2sp92JrNZ3gATCovC6dN5mKmlLHEoBSScxPzzAusBzBaQVD5vhl5Wut91QYlma0ItCqBVvWgDeDbmuOha8TxD9rdhMGart_-ARyuUK76P3VRCS00OYm5KD6-zvUJ3qRvd2mXQ9hqHpf4mRyRxn1pJfA3L6vaFQ
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED4xeNj2MDbYtA62-WEvSHOb-EccPyIEKixFlVo03rLYPgs01CKaPsBfj52kFdqmibco8imWP-fufP7uDuCb45V0CUqaSV5RgbmnucSUpt4arrQ20sTk5NF5NrwQZ5fycgO-r3NhELEhn2E_PjZ3-W5ulzFUNtDB-nCmX8BWsPtCttlaa70bTU_LMExo0LH56lIy0YNiMg5HQZb2447OI7n5iRFquqr8pYob-3KyDaPVzFpaye_-sjZ9-_BH0cbnTv0tvOkcTXLY7ox3sIGzHXj9pPzgLvwKyqEND4aB03gErud0cotor8jkfhZcw8X1gjSkAjJe4NLNSRF88irob0d-XtdXpIg8cjK-w6bTBLrwog1_kthj7eY9XJwcT4-GtOu4QC3nvKYu4JY4n3tjtUpRGZNq5bOKKcu9UCLLjGc2d0ZVvKpy5pn2zkrBw-IzLjX_AJuz-Qw_AuHC8sxIlRifCKF0EBACg3vj08QyrHowWEFQ2q4ceZzrTdkcSxJdBtDKCFrZgdaDg7XEbVuK4z9jdyMG63Hd8vdgf4Vy2f2ri5JJJoObmLLs07-lvsLL4XRUlMXp-Y89eBW_0wZh9mGzvlvi5-CW1OZLsxsfAcRb3WI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Incremental+Text-to-Speech+Synthesis+Using+Pseudo+Lookahead+With+Large+Pretrained+Language+Model&rft.jtitle=IEEE+signal+processing+letters&rft.au=Saeki%2C+Takaaki&rft.au=Takamichi%2C+Shinnosuke&rft.au=Saruwatari%2C+Hiroshi&rft.date=2021&rft.pub=IEEE&rft.issn=1070-9908&rft.volume=28&rft.spage=857&rft.epage=861&rft_id=info:doi/10.1109%2FLSP.2021.3073869&rft.externalDocID=9406329
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1070-9908&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1070-9908&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1070-9908&client=summon