SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteris...
Saved in:
Published in | IEEE signal processing letters Vol. 30; pp. 1 - 5 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.01.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This letter proposes an effective speaker-conditioning method that is applicable to zero-shot multi-speaker text-to-speech (ZSM-TTS) systems. Based on the inductive bias in the speech generation task, in which local context information in text/phoneme sequences heavily affect the speaker characteristics of the output speech, we propose a Speaker-Conditional Convolutional Neural Network (SC-CNN) for the ZSM-TTS task. SC-CNN first predicts convolutional kernels from each learned speaker embedding, then applies 1-D convolutions to phoneme sequences with the predicted kernels. It utilizes the aforementioned inductive bias and effectively models the characteristic of speech by providing the speaker-specific local context in phonetic domain. We also build both FastSpeech2 and VITS-based ZSM-TTS systems to verify its superiority over conventional speaker conditioning methods. The results confirm that the models with SC-CNN outperform the recent ZSM-TTS models in terms of both subjective and objective measurements. |
---|---|
ISSN: | 1070-9908 1558-2361 |
DOI: | 10.1109/LSP.2023.3277786 |