Incremental Text-to-Speech Synthesis Using Pseudo Lookahead With Large Pretrained Language Model

This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-qu...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 28; pp. 857 - 861
Main Authors	Saeki, Takaaki, Takamichi, Shinnosuke, Saruwatari, Hiroshi
Format	Journal Article
Language	English
Published	New York IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Context modeling contextual embedding Decoding end-to-end text-to-speech synthesis Incremental text-to-speech synthesis language model Linguistics Predictive models Speech Speech recognition Speech synthesis Training Tuning
Online Access	Get full text
ISSN	1070-9908 1558-2361
DOI	10.1109/LSP.2021.3073869

Cover

Loading…

More Information
Summary:	This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (hereafter, "lookahead"). To resolve this issue, we propose an incremental TTS method that uses a pseudo lookahead generated with a language model to take the future contextual information into account without increasing latency. Our method can be regarded as imitating a human's incremental reading and uses pretrained GPT2, which accounts for the large-scale linguistic knowledge, for the lookahead generation. Evaluation results show that our method 1) achieves higher speech quality than the method taking only observed information into account and 2) achieves a speech quality equivalent to waiting for the future context observation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2021.3073869