Representation Mixing for TTS Synthesis

Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in c...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5906 - 5910
Main Authors	Kastner, Kyle, Santos, Joao Felipe, Bengio, Yoshua, Courville, Aaron
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2019
Subjects	attention Computer architecture Decoding deep learning Linguistics Pipelines recurrent neural network sequence-to-sequence learning Spectrogram Text-to-speech Training Transforms
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recent character and phoneme-based parametric TTS systems using deep learning have shown strong performance in natural speech generation. However, the choice between character or phoneme input can create serious limitations for practical deployment, as direct control of pronunciation is crucial in certain cases. We demonstrate a simple method for combining multiple types of linguistic information in a single encoder, named representation mixing, enabling flexible choice between character, phoneme, or mixed representations during inference. Experiments and user studies on a public audiobook corpus show the efficacy of our approach.
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2019.8682880