Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contra...

Full description

Saved in:

Bibliographic Details
Main Authors	Grezes, Felix, Ni, Zhaoheng, Trinh, Viet Anh, Mandel, Michael
Format	Journal Article
Language	English
Published	02.12.2020
Subjects	Computer Science - Learning Computer Science - Sound
Online Access	Get full text

Cover

Loading…

Abstract	Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
AbstractList	Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
Author	Grezes, Felix Mandel, Michael Trinh, Viet Anh Ni, Zhaoheng
Author_xml	– sequence: 1 givenname: Felix surname: Grezes fullname: Grezes, Felix – sequence: 2 givenname: Zhaoheng surname: Ni fullname: Ni, Zhaoheng – sequence: 3 givenname: Viet Anh surname: Trinh fullname: Trinh, Viet Anh – sequence: 4 givenname: Michael surname: Mandel fullname: Mandel, Michael
BackLink	https://doi.org/10.48550/arXiv.2012.01576$$DView paper in arXiv
BookMark	eNotj89SgzAYxHPQg9Y-gCfzAmD-QCBHZVp1hraHMuOR-QgfyhRCTUDt24vVy-5ld2d_1-TCDhYJueUsjNI4ZvfgvtvPUDAuQsbjRF2R15V9B2uwRzvSoaH7I4wtdDTrJj-ia-1b8Agea1q0PQZrhx8TWnOiG_AHTyc_B2i-LzZ0i5Obe1scvwZ38DfksoHO4_LfF6RYr4rsOch3Ty_ZQx6ASlSgmRKiikQyn9Fci0jPamqoUwQmjFFC8sokqpEoFWt0VcsYteEy1TxNRSUX5O5v9kxWHl3bgzuVv4TlmVD-AObiTJs
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2012.01576
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2012_01576
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:39:08 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3
OpenAccessLink	https://arxiv.org/abs/2012.01576
ParticipantIDs	arxiv_primary_2012_01576
PublicationCentury	2000
PublicationDate	2020-12-02
PublicationDateYYYYMMDD	2020-12-02
PublicationDate_xml	– month: 12 year: 2020 text: 2020-12-02 day: 02
PublicationDecade	2020
PublicationYear	2020
Score	1.7940226
SecondaryResourceType	preprint
Snippet	Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Learning Computer Science - Sound
Title	Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks
URI	https://arxiv.org/abs/2012.01576
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwFHxqO7EgEKDyKQ-slhynceKxVC0VImUgiG6V7TiAQClqWgT_nvecIlhYMiSOFJ0V3T373hngUjuTpUZZ7pJK8oETlpvIJVwJb7E6MN4J6nfOZ2r6MLiZJ_MOsJ9eGLP6fPlo84Ftg5U5LdVFqIm70JWSLFvXd_N2czJEcW3H_45DjRlu_SGJyR7sbtUdG7bTsQ8dXx_A47h-JmxpHY4tK0anAOOss9HbhlIKkDv4FXJJyagdg09Wrbn5i-WmeW0YGdOf2O19kTMK0sD3Zq1zuzmEYjIuRlO-Pc-AG5UqTonA0g5kit-pQ92DV1eaMvNGSOdQiETWpaqKfaxEpW0ZJ167sJOIOtjGR9Crl7XvA4ssUpvOhJdkMsuk0ULjj1ZmLnKoGcpj6AcUFu9tZMWCAFoEgE7-f3QKO5KqSTJryDPorVcbf46Uu7YXAfdv4UiAuw
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enhancement+of+Spatial+Clustering-Based+Time-Frequency+Masks+using+LSTM+Neural+Networks&rft.au=Grezes%2C+Felix&rft.au=Ni%2C+Zhaoheng&rft.au=Trinh%2C+Viet+Anh&rft.au=Mandel%2C+Michael&rft.date=2020-12-02&rft_id=info:doi/10.48550%2Farxiv.2012.01576&rft.externalDocID=2012_01576