Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks

This paper investigates the use of Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM-RNNs) for voice conversion. Temporal correlations across speech frames are not directly modeled in frame-based methods using conventional Deep Neural Networks (DNNs), which results in...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 4869 - 4873
Main Authors	Lifa Sun, Shiyin Kang, Kun Li, Meng, Helen
Format	Conference Proceeding
Language	English
Published	IEEE 01.04.2015
Subjects	Acoustics bidirectional long short-term memory Context dynamic features Logic gates Recurrent neural networks Speech Training voice conversion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper investigates the use of Deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks (DBLSTM-RNNs) for voice conversion. Temporal correlations across speech frames are not directly modeled in frame-based methods using conventional Deep Neural Networks (DNNs), which results in a limited quality of the converted speech. To improve the naturalness and continuity of the speech output in voice conversion, we propose a sequence-based conversion method using DBLSTM-RNNs to model not only the frame-wised relationship between the source and the target voice, but also the long-range context-dependencies in the acoustic trajectory. Experiments show that DBLSTM-RNNs outperform DNNs where Mean Opinion Scores are 3.2 and 2.3 respectively. Also, DBLSTM-RNNs without dynamic features have better performance than DNNs with dynamic features.
ISSN:	1520-6149
DOI:	10.1109/ICASSP.2015.7178896