Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning
Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all...
Saved in:
Published in | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 6933 - 6937 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.01.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset. |
---|---|
AbstractList | Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset. |
Author | Wu, Haiwei Zhang, Ruixiong Li, Xiangang Jiang, Dongwei Li, Wubo Zou, Wei |
Author_xml | – sequence: 1 givenname: Ruixiong surname: Zhang fullname: Zhang, Ruixiong organization: DiDi Chuxing,Beijing,China – sequence: 2 givenname: Haiwei surname: Wu fullname: Wu, Haiwei organization: DiDi Chuxing,Beijing,China – sequence: 3 givenname: Wubo surname: Li fullname: Li, Wubo organization: DiDi Chuxing,Beijing,China – sequence: 4 givenname: Dongwei surname: Jiang fullname: Jiang, Dongwei organization: DiDi Chuxing,Beijing,China – sequence: 5 givenname: Wei surname: Zou fullname: Zou, Wei organization: DiDi Chuxing,Beijing,China – sequence: 6 givenname: Xiangang surname: Li fullname: Li, Xiangang organization: DiDi Chuxing,Beijing,China |
BookMark | eNotkNFKAzEURKMo2NZ-gS_5ga03N7ubm8datAoFi23Bt5LGuxKx2ZJsBf_eLfZpmOEwDDMUV7GNLIRUMFEK7P3LbLpaLbU1SBMEVBNbqtLa-kKMrSHVx8rUUFWXYoDa2EJZeL8Rw5y_AIBMSQOxXCcXc9OmPSf54DJ_yE3MxwOnn3Ayy8RFj4QY4qfsMTn17TF3wcs3PiTOHDvXhTbKBbt0gm7FdeO-M4_POhKbp8f17LlYvM77vYsiYG27grAhpb1zSETgdEM7bTzWFZIHv6ttvbMM3jUeWGvjsCSq0GMF5KBEpUfi7r83MPP2kMLepd_t-QD9B9tpUw0 |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICASSP39728.2021.9414996 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISBN | 9781728176055 1728176050 |
EISSN | 2379-190X |
EndPage | 6937 |
ExternalDocumentID | 9414996 |
Genre | orig-research |
GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS |
ID | FETCH-LOGICAL-i269t-82f813caa28880a3f8b37c26528c0cb696b9e0cafc0e337a248852c2508a04213 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:39:01 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i269t-82f813caa28880a3f8b37c26528c0cb696b9e0cafc0e337a248852c2508a04213 |
PageCount | 5 |
ParticipantIDs | ieee_primary_9414996 |
PublicationCentury | 2000 |
PublicationDate | 2021-01-01 |
PublicationDateYYYYMMDD | 2021-01-01 |
PublicationDate_xml | – month: 01 year: 2021 text: 2021-01-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) |
PublicationTitleAbbrev | ICASSP |
PublicationYear | 2021 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0008748 |
Score | 2.3045204 |
Snippet | Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem,... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 6933 |
SubjectTerms | acoustic representation learning Acoustics Conferences Emotion recognition Event detection Signal processing Speech recognition Training data Transformer unsupervised pre-training |
Title | Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning |
URI | https://ieeexplore.ieee.org/document/9414996 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07b8IwED4BU7v0AVXf8tCxCY4THHukqIhWoooKSGzIcS5VVSkgmiz99bVNoA916BZHtmzZsr_z-bvvAG4oF8qq3nmpDHIv4hh5aW4ZAFauCjVXKrMO_fETH82ix3lv3oDbXSwMIjryGfr2073lZ0tdWVdZV0bGnpe8CU1zcdvEau1OXRFHYsvUobL7MOhPJokBW2b5Wyzw67Y_kqg4DBkewHjb-4Y68uZXZerrj1_CjP8d3iF0vqL1SLLDoSNoYHEM-9-EBtuQTLf2Ka7JnQGujMyK92plDwpbSNboTetkEcRUI329dGm-yLNjytYBSgWp5VhfOjAb3k8HI6_OpeC9Mi5LT7BcBKFWipkrL1VhLtIw1oz3mNBUp1zyVCLVKtcUwzBWzGzsHtPGQBLK7OsgPIFWsSzwFEgWRLliUgsd60hmgaDGBsoQRS7jgCl6Bm07N4vVRi5jUU_L-d-_L2DPuWkdQ-QSWuW6wiuD82V67Rb4E2IOqG8 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8MgFH-Z86Be_NiM33LwaDtKWwrHuWg23ZbGdcluC6XUGJNume3Fv15ou_kRD96AQGgg8Ht9_N7vAdxgyoRRvbNi7qSWR5VnxalhABi5KiWpEIlx6I_GtD_1Hmf-rAG3m1gYpVRJPlO2KZZv-clCFsZV1uGetuc53YJtjfs-qaK1NvcuCzy25upg3hn0upNJqOGWGAYXcex69I80KiWKPOzDaD1_RR55s4s8tuXHL2nG_37gAbS_4vVQuEGiQ2io7Aj2vkkNtiCM1haqWqE7DV0JmmbvxdJcFaYSrpQV1ekikO6GunJRJvpCzyVXtg5RylAtyPrShunDfdTrW3U2BeuVUJ5bjKTMcaUQRP_0YuGmLHYDSahPmMQyppzGXGEpUomV6waC6KPtE6lNJCb0yXbcY2hmi0ydAEocLxWESyYD6fHEYVhbQYlSLOWBQwQ-hZZZm_myEsyY18ty9nfzNez0o9FwPhyMn85h1-xV5eO4gGa-KtSlRv08vio3-xNdXqu4 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Transformer+Based+Unsupervised+Pre-Training+for+Acoustic+Representation+Learning&rft.au=Zhang%2C+Ruixiong&rft.au=Wu%2C+Haiwei&rft.au=Li%2C+Wubo&rft.au=Jiang%2C+Dongwei&rft.date=2021-01-01&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=6933&rft.epage=6937&rft_id=info:doi/10.1109%2FICASSP39728.2021.9414996&rft.externalDocID=9414996 |