Joint Unsupervised and Supervised Training for Multilingual ASR
Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , , , , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
15.11.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision. |
---|---|
AbstractList | Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3%, and the state-of-the-art 2-stage XLSR by 32%. On low-resource languages like Polish, our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision. |
Author | Li, Bo Khe Chai Sim Zhang, Yu Bapna, Ankur Bai, Junwen Sainath, Tara N Siddhartha, Nikhil |
Author_xml | – sequence: 1 givenname: Junwen surname: Bai fullname: Bai, Junwen – sequence: 2 givenname: Bo surname: Li fullname: Li, Bo – sequence: 3 givenname: Yu surname: Zhang fullname: Zhang, Yu – sequence: 4 givenname: Ankur surname: Bapna fullname: Bapna, Ankur – sequence: 5 givenname: Nikhil surname: Siddhartha fullname: Siddhartha, Nikhil – sequence: 6 fullname: Khe Chai Sim – sequence: 7 givenname: Tara surname: Sainath middlename: N fullname: Sainath, Tara N |
BookMark | eNqNir0KwjAYAIMoWLXvEHAuxC_G1klEFBFc2jqXQFNJCV9qfnx-OwiuTsdxtyBTtKgmJAHON1mxBZiT1PueMQa7HITgCTncrMZAH-jjoNxbe9VSiS2tflo7qVHjk3bW0Xs0QZvRojT0WJUrMuuk8Sr9cknWl3N9umaDs6-ofGh6Gx2OqQGxLzgDnjP-3_UBWbM6fg |
ContentType | Paper |
Copyright | 2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One Community College ProQuest Central Korea SciTech Premium Collection ProQuest Engineering Collection Engineering Database Publicly Available Content Database ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
ID | FETCH-proquest_journals_25983023703 |
IEDL.DBID | 8FG |
IngestDate | Thu Oct 10 19:59:13 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-proquest_journals_25983023703 |
OpenAccessLink | https://www.proquest.com/docview/2598302370?pq-origsite=%requestingapplication% |
PQID | 2598302370 |
PQPubID | 2050157 |
ParticipantIDs | proquest_journals_2598302370 |
PublicationCentury | 2000 |
PublicationDate | 20211115 |
PublicationDateYYYYMMDD | 2021-11-15 |
PublicationDate_xml | – month: 11 year: 2021 text: 20211115 day: 15 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2021 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 3.3714347 |
SecondaryResourceType | preprint |
Snippet | Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Languages Multilingualism Speech recognition Training |
Title | Joint Unsupervised and Supervised Training for Multilingual ASR |
URI | https://www.proquest.com/docview/2598302370 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQAZb5Jklpqaa6yWmmJromlmkWuomWRoa6SQaGieaGJomGFmagjcK-fmYeoSZeEaYR0AG3YuiySliZCC6oU_KTQWPk-sBmOuioKmNzA_uCQl3QrVGg2VXoFRrMDKyGoJPwQDvF3dzhYyxGZubAFrMxRjELrjvcBBlYAxILUouEGJhS84QZ2MFLLpOLRRjsvfIz80oUQvOKSwtAGbY4NUUB2K1XCEZwQ6D3NygAW5YK4K2yoM3jpYk5Co7BQaIMym6uIc4eujBb46Hpojge4QtjMQYWYAc_VYJBAdhKADaT0izNk1PMTIAhlJQMvv8C2NNIA23mS5JkkMFnkhR-aWkGLiPQOgzQ0jVTGQaWkqLSVFlgRVqSJAcOLTkGVidXv4AgIM-3zhUA5dt8-A |
link.rule.ids | 786,790,12792,21416,33406,33777,43633,43838 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwY2BQAZb5Jklpqaa6yWmmJromlmkWuomWRoa6SQaGieaGJomGFmagjcK-fmYeoSZeEaYR0AG3YuiySliZCC6oU_KTQWPk-sBmOuioKmNzA_uCQl3QrVGg2VXoFRrMDKygIzctWBhYnVz9AoLgoyxGZubANrMxRkELrj3cBBlYAxILUouEGJhS84QZ2MGLLpOLRRjsvfIz80oUQvOKSwtAWbY4NUUB2LFXCEZwQ6A3OCgA25YK4M2yoO3jpYk5Co7BQaIMym6uIc4eujBb46Epozge4Q9jMQYWYBc_VYJBAdhOADaU0izNk1PMTIBhlJQMvgED2NdIA23nS5JkkMFnkhR-aXkGTo8QX594H08_b2kGLiPQqgzQQjZTGQaWkqLSVFlgtVqSJAcNOwAsSX6E |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Joint+Unsupervised+and+Supervised+Training+for+Multilingual+ASR&rft.jtitle=arXiv.org&rft.au=Bai%2C+Junwen&rft.au=Li%2C+Bo&rft.au=Zhang%2C+Yu&rft.au=Bapna%2C+Ankur&rft.date=2021-11-15&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422 |