A Deep Ensemble Learning Method for Monaural Speech Separation

Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 24; no. 5; pp. 967 - 977
Main Authors	Zhang, Xiao-Lei, Wang, DeLiang
Format	Journal Article
Language	English
Published	United States IEEE 01.05.2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Cleaning Context Deep neural networks ensemble learning mapping-based separation masking-based separation Masks Modules monaural speech separation multicontext networks Networks Neural networks Optimization Separation Signal to noise ratio Speech Speech processing Stacks Training ensemble learning mapping-based separation multicontext networks Deep neural networks masking-based separation monaural speech separation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Monaural speech separation is a fundamental problem in robust speech processing. Recently, deep neural network (DNN)-based speech separation methods, which predict either clean speech or an ideal time-frequency mask, have demonstrated remarkable performance improvement. However, a single DNN with a given window length does not leverage contextual information sufficiently, and the differences between the two optimization objectives are not well understood. In this paper, we propose a deep ensemble method, named multicontext networks, to address monaural speech separation. The first multicontext network averages the outputs of multiple DNNs whose inputs employ different window lengths. The second multicontext network is a stack of multiple DNNs. Each DNN in a module of the stack takes the concatenation of original acoustic features and expansion of the soft output of the lower module as its input, and predicts the ratio mask of the target speaker; the DNNs in the same module employ different contexts. We have conducted extensive experiments with three speech corpora. The results demonstrate the effectiveness of the proposed method. We have also compared the two optimization objectives systematically and found that predicting the ideal time-frequency mask is more efficient in utilizing clean training speech, while predicting clean speech is less sensitive to SNR variations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2016.2536478