Enhancing End-to-End Multi-Channel Speech Separation Via Spatial Feature Learning

Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In t...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 7319 - 7323
Main Authors	Gu, Rongzhi, Zhang, Shi-Xiong, Chen, Lianwu, Xu, Yong, Yu, Meng, Su, Dan, Zou, Yuexian, Yu, Dong
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2020
Subjects	Computational modeling Computer architecture Convolution end-to-end inter-channel convolution differences multi-channel speech separation Signal to noise ratio spatial features Speech enhancement Time-domain analysis Two dimensional displays
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filters spanning signal channels are trained to perform adaptive spatial filtering. These filters are implemented by a 2d convolution (conv2d) layer and their parameters are optimized using a speech separation objective function in a purely data-driven fashion. Furthermore, inspired by the IPD formulation, we design a conv2d kernel to compute the inter-channel convolution differences (ICDs), which are expected to provide the spatial cues that help to distinguish the directional sources. Evaluation results on simulated multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed ICD based MCSS model improves the overall signal-to-distortion ratio by 10.4% over the IPD based MCSS model.
ISSN:	2379-190X
DOI:	10.1109/ICASSP40776.2020.9053092