Latent source-specific generative factor learning for monaural speech separation using weighted-factor autoencoder

Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio sourc...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers of information technology & electronic engineering Vol. 21; no. 11; pp. 1639 - 1650
Main Authors	Chen, Jing-jing, Mao, Qi-rong, Qin, You-cai, Qian, Shuang-qing, Zheng, Zhi-shen
Format	Journal Article
Language	English
Published	Hangzhou Zhejiang University Press 01.11.2020 Springer Nature B.V School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China%School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang 212013, China Jiangsu Key Laboratory of Security Technology for Industrial Cyberspace, Zhenjiang 212013, China
Subjects	Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Deep learning Dictionaries Electrical Engineering Electronics and Microelectronics Instrumentation Microphones Networks Neural networks Regularization Separation Speech Deep learning Speech separation TN912.3 Generative factors Autoencoder
Online Access	Get full text
ISSN	2095-9184 2095-9230
DOI	10.1631/FITEE.2000019

Cover

More Information
Summary:	Much recent progress in monaural speech separation (MSS) has been achieved through a series of deep learning architectures based on autoencoders, which use an encoder to condense the input signal into compressed features and then feed these features into a decoder to construct a specific audio source of interest. However, these approaches can neither learn generative factors of the original input for MSS nor construct each audio source in mixed speech. In this study, we propose a novel weighted-factor autoencoder (WFAE) model for MSS, which introduces a regularization loss in the objective function to isolate one source without containing other sources. By incorporating a latent attention mechanism and a supervised source constructor in the separation layer, WFAE can learn source-specific generative factors and a set of discriminative features for each source, leading to MSS performance improvement. Experiments on benchmark datasets show that our approach outperforms the existing methods. In terms of three important metrics, WFAE has great success on a relatively challenging MSS case, i.e., speaker-independent MSS.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2095-9184 2095-9230
DOI:	10.1631/FITEE.2000019