Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR
Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize t...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
25.04.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize the degradation due to streaming. However, the performance gap still remains relatively large between non-streaming and a full-contextual model trained independently. To address this, we propose a dynamic chunk-based convolution replacing the causal convolution in a hybrid Connectionist Temporal Classification (CTC)-Attention Conformer architecture. Additionally, we demonstrate further improvements through initialization of weights from a full-contextual model and parallelization of the convolution and self-attention modules. We evaluate our models on the open-source Voxpopuli, LibriSpeech and in-house conversational datasets. Overall, our proposed model reduces the degradation of the streaming mode over the non-streaming full-contextual model from 41.7% and 45.7% to 16.7% and 26.2% on the LibriSpeech test-clean and test-other datasets respectively, while improving by a relative 15.5% WER over the previous state-of-the-art unified model. |
---|---|
AbstractList | Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize the degradation due to streaming. However, the performance gap still remains relatively large between non-streaming and a full-contextual model trained independently. To address this, we propose a dynamic chunk-based convolution replacing the causal convolution in a hybrid Connectionist Temporal Classification (CTC)-Attention Conformer architecture. Additionally, we demonstrate further improvements through initialization of weights from a full-contextual model and parallelization of the convolution and self-attention modules. We evaluate our models on the open-source Voxpopuli, LibriSpeech and in-house conversational datasets. Overall, our proposed model reduces the degradation of the streaming mode over the non-streaming full-contextual model from 41.7% and 45.7% to 16.7% and 26.2% on the LibriSpeech test-clean and test-other datasets respectively, while improving by a relative 15.5% WER over the previous state-of-the-art unified model. |
Author | Li, Xilai Bodapati, Sravan Ronanki, Srikanth Farris, Jeff Huybrechts, Goeric |
Author_xml | – sequence: 1 givenname: Xilai surname: Li fullname: Li, Xilai – sequence: 2 givenname: Goeric surname: Huybrechts fullname: Huybrechts, Goeric – sequence: 3 givenname: Srikanth surname: Ronanki fullname: Ronanki, Srikanth – sequence: 4 givenname: Jeff surname: Farris fullname: Farris, Jeff – sequence: 5 givenname: Sravan surname: Bodapati fullname: Bodapati, Sravan |
BookMark | eNqNjssKwjAURIMoWLX_cMF1ISb2sZWquFKwui7Fppra3qtJK_j3ZiG4dTVw5gzMhA2RUA2YJ6RcBMlSiDHzra055yKKRRhKjx3WbyxafYH01uMdUsIXNX2nCaEiA2fUlVYlZJ1RTsMrFFjCnjD4EbdxaqsMrLLjjI2qorHK_-aUzbebU7oLHoaevbJdXlNv0FW5SLiMYul-yf-sD1HNP-E |
ContentType | Paper |
Copyright | 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials AUTh Library subscriptions: ProQuest Central Technology Collection ProQuest One Community College ProQuest Central SciTech Premium Collection (Proquest) (PQ_SDU_P3) ProQuest Engineering Collection ProQuest Engineering Database Publicly Available Content Database ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
ID | FETCH-proquest_journals_28036738423 |
IEDL.DBID | 8FG |
IngestDate | Thu Oct 10 19:27:58 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-proquest_journals_28036738423 |
OpenAccessLink | https://www.proquest.com/docview/2803673842?pq-origsite=%requestingapplication% |
PQID | 2803673842 |
PQPubID | 2050157 |
ParticipantIDs | proquest_journals_2803673842 |
PublicationCentury | 2000 |
PublicationDate | 20230425 |
PublicationDateYYYYMMDD | 2023-04-25 |
PublicationDate_xml | – month: 04 year: 2023 text: 20230425 day: 25 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2023 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 3.4594846 |
SecondaryResourceType | preprint |
Snippet | Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Convolution Datasets Degradation Speech recognition |
Title | Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR |
URI | https://www.proquest.com/docview/2803673842 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEB60QfDmEx-1LOg12Gyy3eQkGhOLYCxVobeSTTYKhaQmrUd_uzNxaw9Cj_tg2V2GmW9nvuUDuAqKVCnXI16Ym5OEmbIxyim7cPpKSiFUnrZsi2QwfPMeJ2JiEm6NoVWufGLrqPMqoxz5NakokUKlx2_mnzapRlF11UhobIPlcCmJ0uXHD385Fj6QiJjdf262jR3xHlijdK7rfdjS5QHstJTLrDmE5_tfMXgWfizLGQur8suYAUMgyRAMFggPGZWNcVr5zvDNz5KqtNc99F0PMaeu2e3L-Agu4-g1HNqrXUyNnTTT9ancY-jgg1-fAEMIzzOlAsFV33PSwtdFgKhKBL7nSJ3zU-huWuls8_A57JJkOlVEuOhCZ1Ev9QUG1oXqtbfXA-suSkZjbD19Rz8e_YL7 |
link.rule.ids | 783,787,12778,21401,33386,33757,43613,43818 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3fS8MwED60RfTNn_hjakBfi2varO2TaN2oOuuYE_ZWmjWdILSz3fz7vauZexD2moSQhHD33d2XfADXQZ5K6bjEC3MykjCTFno5aeV2W3qeEDJLG7ZF3Ine3aexGOuEW61plUub2BjqrJxQjvyGVJRIodLlt7Mvi1SjqLqqJTQ2waSvqnwDzPtuPBj-ZVl4x0PM7PwztI336O2COUhnqtqDDVXsw1ZDupzUB_D68CsHz8KPRfHJwrL41heBIZRkCAdzBIiMCsc4rJgyjPpZXBbWqoUe7CHqVBW7exsewlWvOwoja7mKRN-UOlntyzkCA0N-dQwMQTyfSBkILtuunea-ygPEVSLwXdtTGT-B1rqZTtd3X8J2NHrpJ_3H-PkMdkhAneojXLTAmFcLdY5udi4v9Fn-AGkUhIE |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Dynamic+Chunk+Convolution+for+Unified+Streaming+and+Non-Streaming+Conformer+ASR&rft.jtitle=arXiv.org&rft.au=Li%2C+Xilai&rft.au=Huybrechts%2C+Goeric&rft.au=Ronanki%2C+Srikanth&rft.au=Farris%2C+Jeff&rft.date=2023-04-25&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422 |