DNN based multi-speaker speech synthesis with temporal auxiliary speaker ID embedding

In this paper, multi speaker speech synthesis using speaker embedding is proposed. The proposed model is based on Tacotron network, but post-processing network of the model is modified with dilated convolution layers, which used in Wavenet architecture, to make it more adaptive to speech. The model...

Full description

Saved in:

Bibliographic Details
Published in	2019 International Conference on Electronics, Information, and Communication (ICEIC) pp. 1 - 4
Main Authors	Lee, Junmo, Song, Kwangsub, Noh, Kyoungjin, Park, Tae-Jun, Chang, Joon-Hyuk
Format	Conference Proceeding
Language	English
Published	Institute of electronics and information engineers (IEIE) 01.01.2019
Subjects	Convolution Data models Decoding deep learning Hidden Markov models multi speaker speech synthesis sequence to sequence Speech synthesis Synthesizers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, multi speaker speech synthesis using speaker embedding is proposed. The proposed model is based on Tacotron network, but post-processing network of the model is modified with dilated convolution layers, which used in Wavenet architecture, to make it more adaptive to speech. The model can generate multi speaker voice with only one neural network model by giving auxiliary input data, speaker embedding, to the network. This model shows successful result for generating two speaker's voices without significant deterioration of speech quality.
DOI:	10.23919/ELINFOCOM.2019.8706390