Deep clustering: Discriminative embeddings for segmentation and separation

We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 31 - 35
Main Authors	Hershey, John R., Zhuo Chen, Le Roux, Jonathan, Watanabe, Shinji
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.03.2016
Subjects	Affinity Clustering deep learning Electronics embedding Indexes Machine learning Networks Neural networks Segmentation Separation Spectrogram Spectrograms Speech speech separation Time-frequency analysis Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2016.7471631