Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model
This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-qu...
Saved in:
Published in | IEEE signal processing letters Vol. 28; pp. 857 - 861 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 1070-9908 1558-2361 |
DOI | 10.1109/LSP.2021.3073869 |
Cover
Abstract | This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation. |
---|---|
AbstractList | This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation. |
Author | Takamichi, Shinnosuke Saruwatari, Hiroshi Saeki, Takaaki |
Author_xml | – sequence: 1 givenname: Takaaki orcidid: 0000-0001-6003-768X surname: Saeki fullname: Saeki, Takaaki email: takaaki_saeki@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan – sequence: 2 givenname: Shinnosuke orcidid: 0000-0003-0520-7847 surname: Takamichi fullname: Takamichi, Shinnosuke email: shinnosuke_takamichi@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan – sequence: 3 givenname: Hiroshi orcidid: 0000-0003-0876-5617 surname: Saruwatari fullname: Saruwatari, Hiroshi email: hiroshi_saruwatari@ipc.i.u-tokyo.ac.jp organization: Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan |
BookMark | eNp9kEtLAzEUhYMo2Fb3gpsB11Pz6MwkSyk-CiMW2uJyzCQ37dQ2qUkG7L93SosLF67u5XC--zh9dG6dBYRuCB4SgsV9OZsOKaZkyHDBeC7OUI9kGU8py8l51-MCp0Jgfon6IawxxpzwrIc-JlZ52IKNcpPM4Tum0aWzHYBaJbO9jSsITUgWobHLZBqg1S4pnfuUK5A6eW_iKimlX0Iy9RC9bCzoTrDLVnbaq9OwuUIXRm4CXJ_qAC2eHufjl7R8e56MH8pUMcZiqjONsTbc1EoUBIq6JqIwuaSFYmZUjPK8NlRxXReSScmpocJolY1Y9wdlmWADdHecu_Puq4UQq7Vrve1WVjSjGWeC0Lxz5UeX8i4ED6ZSTZSxcfZw_aYiuDqkWXVpVoc0q1OaHYj_gDvfbKXf_4fcHpEGAH7tYoRzRgX7AW1qgiE |
CODEN | ISPLEM |
CitedBy_id | crossref_primary_10_1016_j_joi_2023_101453 crossref_primary_10_1109_ACCESS_2023_3251657 crossref_primary_10_1109_TASLP_2022_3196879 crossref_primary_10_1109_TASLP_2024_3369537 |
Cites_doi | 10.21437/SSW.2019-33 10.1007/s00530-014-0446-1 10.18653/v1/P18-1082 10.1609/aaai.v33i01.33016706 10.1109/ICASSP.2017.7953075 10.21437/Interspeech.2020-1822 10.21437/Interspeech.2020-2103 10.21437/Interspeech.2017-1452 10.21437/Interspeech.2018-1456 10.1016/j.specom.2009.04.004 10.1177/1745691610393980 10.1109/ICASSP.2015.7178964 10.1109/ICASSP.2018.8461368 10.1109/ICASSP.2019.8683143 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
DBID | 97E ESBDL RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/LSP.2021.3073869 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE Xplore Open Access Journals IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2361 |
EndPage | 861 |
ExternalDocumentID | 10_1109_LSP_2021_3073869 9406329 |
Genre | orig-research |
GrantInformation_xml | – fundername: JSPS KAKENHI grantid: 17H06101; 19H01116; MIC/SCOPE #182103104 |
GroupedDBID | -~X .DC 0R~ 29I 3EH 4.4 5GY 5VS 6IK 85S 97E AAJGR AARMG AASAJ AAWTH AAYJJ ABAZT ABFSI ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD ESBDL F5P HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS TAE TN5 VH1 AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593 |
IEDL.DBID | RIE |
ISSN | 1070-9908 |
IngestDate | Sun Jun 29 17:00:53 EDT 2025 Thu Apr 24 22:56:18 EDT 2025 Tue Jul 01 02:21:36 EDT 2025 Wed Aug 27 02:29:30 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Language | English |
License | https://creativecommons.org/licenses/by/4.0/legalcode |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c333t-d5d00df8fbc971e7bb197f6a27c3f47466bf2c8db7a3aa82f29fdc54308123593 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0003-0876-5617 0000-0003-0520-7847 0000-0001-6003-768X |
OpenAccessLink | https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/document/9406329 |
PQID | 2525839126 |
PQPubID | 75747 |
PageCount | 5 |
ParticipantIDs | crossref_citationtrail_10_1109_LSP_2021_3073869 crossref_primary_10_1109_LSP_2021_3073869 ieee_primary_9406329 proquest_journals_2525839126 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 20210000 2021-00-00 20210101 |
PublicationDateYYYYMMDD | 2021-01-01 |
PublicationDate_xml | – year: 2021 text: 20210000 |
PublicationDecade | 2020 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE signal processing letters |
PublicationTitleAbbrev | LSP |
PublicationYear | 2021 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 ref14 tokuda (ref7) 0 ref11 ref10 ref17 sudoh (ref2) 2020 cong (ref6) 2020 radford (ref3) 2019 bangalore (ref1) 0 ochshorn (ref19) 2017 ref24 ref23 ref26 ito (ref18) 2017 ref25 ref22 wang (ref16) 0 kingma (ref20) 2014 levenshtein (ref21) 1966; 10 zen (ref9) 0 ref8 ref5 ma (ref4) 0 |
References_xml | – start-page: 227 year: 0 ident: ref7 article-title: An HMM-based speech synthesis system applied to english publication-title: Proc IEEE Workshop Speech Synthesis – start-page: 437 year: 0 ident: ref1 article-title: Real-time incremental speech-to-speech translation of dialogs publication-title: Proc Conf North Amer Chapter Assoc Comput Linguist Human Lang Technol – ident: ref12 doi: 10.21437/SSW.2019-33 – ident: ref25 doi: 10.1007/s00530-014-0446-1 – ident: ref17 doi: 10.18653/v1/P18-1082 – ident: ref11 doi: 10.1609/aaai.v33i01.33016706 – year: 2019 ident: ref3 article-title: Language models are unsupervised multitask learners – year: 2020 ident: ref2 article-title: Simultaneous speech-to-speech translation system with neural incremental ASR MT, and TTS – ident: ref22 doi: 10.1109/ICASSP.2017.7953075 – start-page: 3886 year: 0 ident: ref4 article-title: Incremental text-to-speech synthesis with prefix-to-prefix framework publication-title: Proc Empirical Methods Natural Lang Process Online – ident: ref14 doi: 10.21437/Interspeech.2020-1822 – year: 2017 ident: ref18 article-title: The LJ Speech Dataset – ident: ref13 doi: 10.21437/Interspeech.2020-2103 – start-page: 5180 year: 0 ident: ref16 article-title: Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis publication-title: Proc Int Conf Mach Learn (ICML) – start-page: 7962 year: 0 ident: ref9 article-title: Statistical parametric speech synthesis using deep neural networks publication-title: Proc IEEE Int Conf Acoust Speech Signal Process – volume: 10 start-page: 707 year: 1966 ident: ref21 article-title: Binary codes capable of correcting deletions, insertions and reversals publication-title: Sov Phys Doklady – ident: ref10 doi: 10.21437/Interspeech.2017-1452 – ident: ref24 doi: 10.21437/Interspeech.2018-1456 – ident: ref8 doi: 10.1016/j.specom.2009.04.004 – year: 2014 ident: ref20 article-title: Adam: A method for stochastic optimization – ident: ref26 doi: 10.1177/1745691610393980 – ident: ref23 doi: 10.1109/ICASSP.2015.7178964 – ident: ref5 doi: 10.1109/ICASSP.2018.8461368 – year: 2020 ident: ref6 article-title: PPSpeech: Phrase based parallel end-to-end TTS system – ident: ref15 doi: 10.1109/ICASSP.2019.8683143 – year: 2017 ident: ref19 article-title: Gentle: A Robust Yet Lenient Forced Aligner Built on Kaldi |
SSID | ssj0008185 |
Score | 2.3802333 |
Snippet | This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 857 |
SubjectTerms | Context modeling contextual embedding Decoding end-to-end text-to-speech synthesis Incremental text-to-speech synthesis language model Linguistics Predictive models Speech Speech recognition Speech synthesis Training Tuning |
Title | Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model |
URI | https://ieeexplore.ieee.org/document/9406329 https://www.proquest.com/docview/2525839126 |
Volume | 28 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4Bp3KgD0BsSysfeqlU7yZ-xPGxQiBULdVKC4JbiO2xQKBdxGYP8Os7TrIr1KKqtyiaiSx_k3l5PAPwNchahww1L7SsucIy8lJjzvPonTTWOu3S5eSzX8Xphfp5pa824Pv6LgwitsVnOEyP7Vl-mPtlSpWNLFkfKewmbJKYdXe11lo3GZ6uvjDjpGHL1ZFkZkfj6YQCQZEPkzyXqbT5hQlqZ6r8pYhb63LyFs5W6-qKSu6Gy8YN_fMfLRv_d-HvYKd3M9mPTi7ewwbOPsD2i-aDu3BNqqFLDhLheQqAmzmfPiD6GzZ9mpFjuLhdsLakgE0WuAxzNiaPvCbtHdjlbXPDxqmKnE0esZ0zgYFedMlPlias3e_Bxcnx-dEp7-ctcC-lbHgg1LIQy-i8NTka53JrYlEL42VURhWFi8KXwZla1nUporAxeK0kbb6Q2sp92JrNZ3gATCovC6dN5mKmlLHEoBSScxPzzAusBzBaQVD5vhl5Wut91QYlma0ItCqBVvWgDeDbmuOha8TxD9rdhMGart_-ARyuUK76P3VRCS00OYm5KD6-zvUJ3qRvd2mXQ9hqHpf4mRyRxn1pJfA3L6vaFQ |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED4xeNj2MDbYtA62-WEvSHOb-EccPyIEKixFlVo03rLYPgs01CKaPsBfj52kFdqmibco8imWP-fufP7uDuCb45V0CUqaSV5RgbmnucSUpt4arrQ20sTk5NF5NrwQZ5fycgO-r3NhELEhn2E_PjZ3-W5ulzFUNtDB-nCmX8BWsPtCttlaa70bTU_LMExo0LH56lIy0YNiMg5HQZb2447OI7n5iRFquqr8pYob-3KyDaPVzFpaye_-sjZ9-_BH0cbnTv0tvOkcTXLY7ox3sIGzHXj9pPzgLvwKyqEND4aB03gErud0cotor8jkfhZcw8X1gjSkAjJe4NLNSRF88irob0d-XtdXpIg8cjK-w6bTBLrwog1_kthj7eY9XJwcT4-GtOu4QC3nvKYu4JY4n3tjtUpRGZNq5bOKKcu9UCLLjGc2d0ZVvKpy5pn2zkrBw-IzLjX_AJuz-Qw_AuHC8sxIlRifCKF0EBACg3vj08QyrHowWEFQ2q4ceZzrTdkcSxJdBtDKCFrZgdaDg7XEbVuK4z9jdyMG63Hd8vdgf4Vy2f2ri5JJJoObmLLs07-lvsLL4XRUlMXp-Y89eBW_0wZh9mGzvlvi5-CW1OZLsxsfAcRb3WI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Incremental+Text-to-Speech+Synthesis+Using+Pseudo+Lookahead+With+Large+Pretrained+Language+Model&rft.jtitle=IEEE+signal+processing+letters&rft.au=Saeki%2C+Takaaki&rft.au=Takamichi%2C+Shinnosuke&rft.au=Saruwatari%2C+Hiroshi&rft.date=2021&rft.pub=IEEE&rft.issn=1070-9908&rft.volume=28&rft.spage=857&rft.epage=861&rft_id=info:doi/10.1109%2FLSP.2021.3073869&rft.externalDocID=9406329 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1070-9908&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1070-9908&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1070-9908&client=summon |