Neural Source-Filter Waveform Models for Statistical Parametric Speech Synthesis

Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 28; pp. 402 - 415
Main Authors	Wang, Xin, Takaki, Shinji, Yamagishi, Junichi
Format	Journal Article
Language	English
Published	Piscataway IEEE 2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Autoregressive models Autoregressive processes Computational modeling Convolution Excitation Fourier transforms Machine learning Mathematical model Modules neural network Neural networks short-time Fourier transform Speech recognition Speech synthesis Training Vocoders waveform model Waveforms
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Neural waveform models have demonstrated better performance than conventional vocoders for statistical parametric speech synthesis. One of the best models, called WaveNet, uses an autoregressive (AR) approach to model the distribution of waveform sampling points, but it has to generate a waveform in a time-consuming sequential manner. Some new models that use inverse-autoregressive flow (IAF) can generate a whole waveform in a one-shot manner but require either a larger amount of training time or a complicated model architecture plus a blend of training criteria. As an alternative to AR and IAF-based frameworks, we propose a neural source-filter (NSF) waveform modeling framework that is straightforward to train and fast to generate waveforms. This framework requires three components to generate waveforms: a source module that generates a sine-based signal as excitation, a non-AR dilated-convolution-based filter module that transforms the excitation into a waveform, and a conditional module that pre-processes the input acoustic features for the source and filter modules. This framework minimizes spectral-amplitude distances for model training, which can be efficiently implemented using short-time Fourier transform routines. As an initial NSF study, we designed three NSF models under the proposed framework and compared them with WaveNet using our deep learning toolkit. It was demonstrated that the NSF models generated waveforms at least 100 times faster than our WaveNet-vocoder, and the quality of the synthetic speech from the best NSF model was comparable to that from WaveNet on a large single-speaker Japanese speech corpus.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2019.2956145