Complex Ratio Masking for Monaural Speech Separation

Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent stud...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 24; no. 3; pp. 483 - 492
Main Authors	Williamson, Donald S., Wang, Yuxuan, Wang, DeLiang
Format	Journal Article
Language	English
Published	United States IEEE 01.03.2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Fourier transforms Neural networks Noise measurement Signal to noise ratio Spectrogram Speech Speech enhancement Time-frequency analysis speech separation deep neural networks speech quality Complex ideal ratio mask complex ideal ratio mask Deep neural networks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2015.2512042