Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning

Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 6933 - 6937
Main Authors Zhang, Ruixiong, Wu, Haiwei, Li, Wubo, Jiang, Dongwei, Zou, Wei, Li, Xiangang
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.
AbstractList Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem, we propose an unsupervised pre-training method using Transformer based encoder to learn a general and robust high-level representation for all acoustic tasks. Experiments have been conducted on three kinds of acoustic tasks: speech emotion recognition, sound event detection and speech translation. All the experiments have shown that pre-training using its own training data can significantly improve the performance. With a larger pre-training data combining MuST-C, Librispeech and ESC-US datasets, for speech emotion recognition, the UAR can further improve absolutely 4.3% on IEMOCAP dataset. For sound event detection, the F1 score can further improve absolutely 1.5% on DCASE2018 task5 development set and 2.1% on evaluation set. For speech translation, the BLEU score can further improve relatively 12.2% on En-De dataset and 8.4% on En-Fr dataset.
Author Wu, Haiwei
Zhang, Ruixiong
Li, Xiangang
Jiang, Dongwei
Li, Wubo
Zou, Wei
Author_xml – sequence: 1
  givenname: Ruixiong
  surname: Zhang
  fullname: Zhang, Ruixiong
  organization: DiDi Chuxing,Beijing,China
– sequence: 2
  givenname: Haiwei
  surname: Wu
  fullname: Wu, Haiwei
  organization: DiDi Chuxing,Beijing,China
– sequence: 3
  givenname: Wubo
  surname: Li
  fullname: Li, Wubo
  organization: DiDi Chuxing,Beijing,China
– sequence: 4
  givenname: Dongwei
  surname: Jiang
  fullname: Jiang, Dongwei
  organization: DiDi Chuxing,Beijing,China
– sequence: 5
  givenname: Wei
  surname: Zou
  fullname: Zou, Wei
  organization: DiDi Chuxing,Beijing,China
– sequence: 6
  givenname: Xiangang
  surname: Li
  fullname: Li, Xiangang
  organization: DiDi Chuxing,Beijing,China
BookMark eNotkNFKAzEURKMo2NZ-gS_5ga03N7ubm8datAoFi23Bt5LGuxKx2ZJsBf_eLfZpmOEwDDMUV7GNLIRUMFEK7P3LbLpaLbU1SBMEVBNbqtLa-kKMrSHVx8rUUFWXYoDa2EJZeL8Rw5y_AIBMSQOxXCcXc9OmPSf54DJ_yE3MxwOnn3Ayy8RFj4QY4qfsMTn17TF3wcs3PiTOHDvXhTbKBbt0gm7FdeO-M4_POhKbp8f17LlYvM77vYsiYG27grAhpb1zSETgdEM7bTzWFZIHv6ttvbMM3jUeWGvjsCSq0GMF5KBEpUfi7r83MPP2kMLepd_t-QD9B9tpUw0
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP39728.2021.9414996
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9781728176055
1728176050
EISSN 2379-190X
EndPage 6937
ExternalDocumentID 9414996
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-i269t-82f813caa28880a3f8b37c26528c0cb696b9e0cafc0e337a248852c2508a04213
IEDL.DBID RIE
IngestDate Wed Aug 27 02:39:01 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i269t-82f813caa28880a3f8b37c26528c0cb696b9e0cafc0e337a248852c2508a04213
PageCount 5
ParticipantIDs ieee_primary_9414996
PublicationCentury 2000
PublicationDate 2021-01-01
PublicationDateYYYYMMDD 2021-01-01
PublicationDate_xml – month: 01
  year: 2021
  text: 2021-01-01
  day: 01
PublicationDecade 2020
PublicationTitle Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998)
PublicationTitleAbbrev ICASSP
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.3045204
Snippet Recently, a variety of acoustic tasks and related applications arised. For many acoustic tasks, the labeled data size may be limited. To handle this problem,...
SourceID ieee
SourceType Publisher
StartPage 6933
SubjectTerms acoustic representation learning
Acoustics
Conferences
Emotion recognition
Event detection
Signal processing
Speech recognition
Training data
Transformer
unsupervised pre-training
Title Transformer Based Unsupervised Pre-Training for Acoustic Representation Learning
URI https://ieeexplore.ieee.org/document/9414996
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07b8IwED4BU7v0AVXf8tCxCY4THHukqIhWoooKSGzIcS5VVSkgmiz99bVNoA916BZHtmzZsr_z-bvvAG4oF8qq3nmpDHIv4hh5aW4ZAFauCjVXKrMO_fETH82ix3lv3oDbXSwMIjryGfr2073lZ0tdWVdZV0bGnpe8CU1zcdvEau1OXRFHYsvUobL7MOhPJokBW2b5Wyzw67Y_kqg4DBkewHjb-4Y68uZXZerrj1_CjP8d3iF0vqL1SLLDoSNoYHEM-9-EBtuQTLf2Ka7JnQGujMyK92plDwpbSNboTetkEcRUI329dGm-yLNjytYBSgWp5VhfOjAb3k8HI6_OpeC9Mi5LT7BcBKFWipkrL1VhLtIw1oz3mNBUp1zyVCLVKtcUwzBWzGzsHtPGQBLK7OsgPIFWsSzwFEgWRLliUgsd60hmgaDGBsoQRS7jgCl6Bm07N4vVRi5jUU_L-d-_L2DPuWkdQ-QSWuW6wiuD82V67Rb4E2IOqG8
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8MgFH-Z86Be_NiM33LwaDtKWwrHuWg23ZbGdcluC6XUGJNume3Fv15ou_kRD96AQGgg8Ht9_N7vAdxgyoRRvbNi7qSWR5VnxalhABi5KiWpEIlx6I_GtD_1Hmf-rAG3m1gYpVRJPlO2KZZv-clCFsZV1uGetuc53YJtjfs-qaK1NvcuCzy25upg3hn0upNJqOGWGAYXcex69I80KiWKPOzDaD1_RR55s4s8tuXHL2nG_37gAbS_4vVQuEGiQ2io7Aj2vkkNtiCM1haqWqE7DV0JmmbvxdJcFaYSrpQV1ekikO6GunJRJvpCzyVXtg5RylAtyPrShunDfdTrW3U2BeuVUJ5bjKTMcaUQRP_0YuGmLHYDSahPmMQyppzGXGEpUomV6waC6KPtE6lNJCb0yXbcY2hmi0ydAEocLxWESyYD6fHEYVhbQYlSLOWBQwQ-hZZZm_myEsyY18ty9nfzNez0o9FwPhyMn85h1-xV5eO4gGa-KtSlRv08vio3-xNdXqu4
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+...+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%281998%29&rft.atitle=Transformer+Based+Unsupervised+Pre-Training+for+Acoustic+Representation+Learning&rft.au=Zhang%2C+Ruixiong&rft.au=Wu%2C+Haiwei&rft.au=Li%2C+Wubo&rft.au=Jiang%2C+Dongwei&rft.date=2021-01-01&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=6933&rft.epage=6937&rft_id=info:doi/10.1109%2FICASSP39728.2021.9414996&rft.externalDocID=9414996