Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks
Recent works have shown that Deep Recurrent Neural Networks using the LSTM architecture can achieve strong single-channel speech enhancement by estimating time-frequency masks. However, these models do not naturally generalize to multi-channel inputs from varying microphone configurations. In contra...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.12.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recent works have shown that Deep Recurrent Neural Networks using the LSTM
architecture can achieve strong single-channel speech enhancement by estimating
time-frequency masks. However, these models do not naturally generalize to
multi-channel inputs from varying microphone configurations. In contrast,
spatial clustering techniques can achieve such generalization but lack a strong
signal model. Our work proposes a combination of the two approaches. By using
LSTMs to enhance spatial clustering based time-frequency masks, we achieve both
the signal modeling performance of multiple single-channel LSTM-DNN speech
enhancers and the signal separation performance and generality of multi-channel
spatial clustering. We compare our proposed system to several baselines on the
CHiME-3 dataset. We evaluate the quality of the audio from each system using
SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the
output of each system using word error rate from a Kaldi automatic speech
recognizer. |
---|---|
AbstractList | Recent works have shown that Deep Recurrent Neural Networks using the LSTM
architecture can achieve strong single-channel speech enhancement by estimating
time-frequency masks. However, these models do not naturally generalize to
multi-channel inputs from varying microphone configurations. In contrast,
spatial clustering techniques can achieve such generalization but lack a strong
signal model. Our work proposes a combination of the two approaches. By using
LSTMs to enhance spatial clustering based time-frequency masks, we achieve both
the signal modeling performance of multiple single-channel LSTM-DNN speech
enhancers and the signal separation performance and generality of multi-channel
spatial clustering. We compare our proposed system to several baselines on the
CHiME-3 dataset. We evaluate the quality of the audio from each system using
SDR from the BSS\_eval toolkit and PESQ. We evaluate the intelligibility of the
output of each system using word error rate from a Kaldi automatic speech
recognizer. |
Author | Grezes, Felix Mandel, Michael Trinh, Viet Anh Ni, Zhaoheng |
Author_xml | – sequence: 1 givenname: Felix surname: Grezes fullname: Grezes, Felix – sequence: 2 givenname: Zhaoheng surname: Ni fullname: Ni, Zhaoheng – sequence: 3 givenname: Viet Anh surname: Trinh fullname: Trinh, Viet Anh – sequence: 4 givenname: Michael surname: Mandel fullname: Mandel, Michael |
BackLink | https://doi.org/10.48550/arXiv.2012.01576$$DView paper in arXiv |
BookMark | eNotj89SgzAYxHPQg9Y-gCfzAmD-QCBHZVp1hraHMuOR-QgfyhRCTUDt24vVy-5ld2d_1-TCDhYJueUsjNI4ZvfgvtvPUDAuQsbjRF2R15V9B2uwRzvSoaH7I4wtdDTrJj-ia-1b8Agea1q0PQZrhx8TWnOiG_AHTyc_B2i-LzZ0i5Obe1scvwZ38DfksoHO4_LfF6RYr4rsOch3Ty_ZQx6ASlSgmRKiikQyn9Fci0jPamqoUwQmjFFC8sokqpEoFWt0VcsYteEy1TxNRSUX5O5v9kxWHl3bgzuVv4TlmVD-AObiTJs |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2012.01576 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2012_01576 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3 |
IEDL.DBID | GOX |
IngestDate | Mon Jan 08 05:39:08 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a676-90622b427157919249919cdad8ea02cc6231bc76f3e360f9bd35e9c13891882b3 |
OpenAccessLink | https://arxiv.org/abs/2012.01576 |
ParticipantIDs | arxiv_primary_2012_01576 |
PublicationCentury | 2000 |
PublicationDate | 2020-12-02 |
PublicationDateYYYYMMDD | 2020-12-02 |
PublicationDate_xml | – month: 12 year: 2020 text: 2020-12-02 day: 02 |
PublicationDecade | 2020 |
PublicationYear | 2020 |
Score | 1.7940226 |
SecondaryResourceType | preprint |
Snippet | Recent works have shown that Deep Recurrent Neural Networks using the LSTM
architecture can achieve strong single-channel speech enhancement by estimating... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Learning Computer Science - Sound |
Title | Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks |
URI | https://arxiv.org/abs/2012.01576 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwFHxqO7EgEKDyKQ-slhynceKxVC0VImUgiG6V7TiAQClqWgT_nvecIlhYMiSOFJ0V3T373hngUjuTpUZZ7pJK8oETlpvIJVwJb7E6MN4J6nfOZ2r6MLiZJ_MOsJ9eGLP6fPlo84Ftg5U5LdVFqIm70JWSLFvXd_N2czJEcW3H_45DjRlu_SGJyR7sbtUdG7bTsQ8dXx_A47h-JmxpHY4tK0anAOOss9HbhlIKkDv4FXJJyagdg09Wrbn5i-WmeW0YGdOf2O19kTMK0sD3Zq1zuzmEYjIuRlO-Pc-AG5UqTonA0g5kit-pQ92DV1eaMvNGSOdQiETWpaqKfaxEpW0ZJ167sJOIOtjGR9Crl7XvA4ssUpvOhJdkMsuk0ULjj1ZmLnKoGcpj6AcUFu9tZMWCAFoEgE7-f3QKO5KqSTJryDPorVcbf46Uu7YXAfdv4UiAuw |
link.rule.ids | 228,230,786,891 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Enhancement+of+Spatial+Clustering-Based+Time-Frequency+Masks+using+LSTM+Neural+Networks&rft.au=Grezes%2C+Felix&rft.au=Ni%2C+Zhaoheng&rft.au=Trinh%2C+Viet+Anh&rft.au=Mandel%2C+Michael&rft.date=2020-12-02&rft_id=info:doi/10.48550%2Farxiv.2012.01576&rft.externalDocID=2012_01576 |