UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Kim, Heeseung, Kim, Sungwon, Yeom, Jiheum, Yoon, Sungroh
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 28.06.2023
Subjects	Coders Customization Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We propose UnitSpeech, a speaker-adaptive speech synthesis method that fine-tunes a diffusion-based text-to-speech (TTS) model using minimal untranscribed data. To achieve this, we use the self-supervised unit representation as a pseudo transcript and integrate the unit encoder into the pre-trained TTS model. We train the unit encoder to provide speech content to the diffusion-based decoder and then fine-tune the decoder for speaker adaptation to the reference speaker using a single \(<\)unit, speech\(>\) pair. UnitSpeech performs speech synthesis tasks such as TTS and voice conversion (VC) in a personalized manner without requiring model re-training for each task. UnitSpeech achieves comparable and superior results on personalized TTS and any-to-any VC tasks compared to previous baselines. Our model also shows widespread adaptive performance on real-world data and other tasks that use a unit sequence as input.
ISSN:	2331-8422