UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Kim, Heeseung, Kim, Sungwon, Yeom, Jiheum, Yoon, Sungroh
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 28.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single \(<\)unit, speech\(>\) pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.
ISSN:2331-8422