Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contra...

Full description

Saved in:
Bibliographic Details
Main Authors Grezes, Felix, Ni, Zhaoheng, Trinh, Viet Anh, Mandel, Michael
Format Journal Article
LanguageEnglish
Published 02.12.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
AbstractList Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contrast, spatial clustering techniques can achieve such generalization but lack a strong signal model. Our work proposes a combination of the two approaches. By using LSTMs to enhance spatial clustering based time-frequency masks, we achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance and generality of multi-channel spatial clustering. We compare our proposed system to several baselines on the CHiME-3 dataset. We evaluate the quality of the audio from each system using SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
Author Grezes, Felix
Mandel, Michael
Trinh, Viet Anh
Ni, Zhaoheng
Author_xml – sequence: 1
  givenname: Felix
  surname: Grezes
  fullname: Grezes, Felix
– sequence: 2
  givenname: Zhaoheng
  surname: Ni
  fullname: Ni, Zhaoheng
– sequence: 3
  givenname: Viet Anh
  surname: Trinh
  fullname: Trinh, Viet Anh
– sequence: 4
  givenname: Michael
  surname: Mandel
  fullname: Mandel, Michael
BackLink https://doi.org/10.48550/arXiv.2012.01576$$DView paper in arXiv
BookMark eNotj89SgzAYxHPQg9Y-gCfzAmD-QCBHZVp1hraHMuOR-QgfyhRCTUDt24vVy-5ld2d_1-TCDhYJueUsjNI4ZvfgvtvPUDAuQsbjRF2R15V9B2uwRzvSoaH7I4wtdDTrJj-ia-1b8Agea1q0PQZrhx8TWnOiG_AHTyc_B2i-LzZ0i5Obe1scvwZ38DfksoHO4_LfF6RYr4rsOch3Ty_ZQx6ASlSgmRKiikQyn9Fci0jPamqoUwQmjFFC8sokqpEoFWt0VcsYteEy1TxNRSUX5O5v9kxWHl3bgzuVv4TlmVD-AObiTJs
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2012.01576
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2012_01576
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:39:08 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3
OpenAccessLink https://arxiv.org/abs/2012.01576
ParticipantIDs arxiv_primary_2012_01576
PublicationCentury 2000
PublicationDate 2020-12-02
PublicationDateYYYYMMDD 2020-12-02
PublicationDate_xml – month: 12
  year: 2020
  text: 2020-12-02
  day: 02
PublicationDecade 2020
PublicationYear 2020
Score 1.7940226
SecondaryResourceType preprint
Snippet Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Learning
Computer Science - Sound
Title Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks
URI https://arxiv.org/abs/2012.01576
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwFHxqO7EgEKDyKQ-slhynceKxVC0VImUgiG6V7TiAQClqWgT_nvecIlhYMiSOFJ0V3T373hngUjuTpUZZ7pJK8oETlpvIJVwJb7E6MN4J6nfOZ2r6MLiZJ_MOsJ9eGLP6fPlo84Ftg5U5LdVFqIm70JWSLFvXd_N2czJEcW3H_45DjRlu_SGJyR7sbtUdG7bTsQ8dXx_A47h-JmxpHY4tK0anAOOss9HbhlIKkDv4FXJJyagdg09Wrbn5i-WmeW0YGdOf2O19kTMK0sD3Zq1zuzmEYjIuRlO-Pc-AG5UqTonA0g5kit-pQ92DV1eaMvNGSOdQiETWpaqKfaxEpW0ZJ167sJOIOtjGR9Crl7XvA4ssUpvOhJdkMsuk0ULjj1ZmLnKoGcpj6AcUFu9tZMWCAFoEgE7-f3QKO5KqSTJryDPorVcbf46Uu7YXAfdv4UiAuw
link.rule.ids 228,230,786,891
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enhancement+of+Spatial+Clustering-Based+Time-Frequency+Masks+using+LSTM+Neural+Networks&rft.au=Grezes%2C+Felix&rft.au=Ni%2C+Zhaoheng&rft.au=Trinh%2C+Viet+Anh&rft.au=Mandel%2C+Michael&rft.date=2020-12-02&rft_id=info:doi/10.48550%2Farxiv.2012.01576&rft.externalDocID=2012_01576