Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection

Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 24; no. 2; pp. 252 - 264
Main Authors	Zhang, Xiao-Lei, Wang, DeLiang
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.02.2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Audio signals Classifiers Cochleagram deep neural network ensemble learning Frames multi-resolution stacking Neural networks Noise noise-independent training Signal processing Signal to noise ratio Speech Stacking Training Voice voice activity detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Voice activity detection (VAD) is an important topic in audio signal processing. Contextual information is important for improving the performance of VAD at low signal-to-noise ratios. Here we explore contextual information by machine learning methods at three levels. At the top level, we employ an ensemble learning framework, named multi-resolution stacking (MRS), which is a stack of ensemble classifiers. Each classifier in a building block inputs the concatenation of the predictions of its lower building blocks and the expansion of the raw acoustic feature by a given window (called a resolution). At the middle level, we describe a base classifier in MRS, named boosted deep neural network (bDNN). bDNN first generates multiple base predictions from different contexts of a single frame by only one DNN and then aggregates the base predictions for a better prediction of the frame, and it is different from computationally-expensive boosting methods that train ensembles of classifiers for multiple base predictions. At the bottom level, we employ the multi-resolution cochleagram feature, which incorporates the contextual information by concatenating the cochleagram features at multiple spectrotemporal resolutions. Experimental results show that the MRS-based VAD outperforms other VADs by a considerable margin. Moreover, when trained on a large amount of noise types and a wide range of signal-to-noise ratios, the MRS-based VAD demonstrates surprisingly good generalization performance on unseen test scenarios, approaching the performance with noise-dependent training.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2015.2505415