Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a mul...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors Bai, Ye, Li, Chenxing, Li, Hao, Zhao, Yuanyuan, Wang, Xiaorui
Format Conference Proceeding
LanguageEnglish
Published IEEE 15.07.2024
Subjects
Online AccessGet full text
ISSN1945-788X
DOI10.1109/ICME57554.2024.10687477

Cover

Abstract In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
AbstractList In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
Author Zhao, Yuanyuan
Wang, Xiaorui
Bai, Ye
Li, Chenxing
Li, Hao
Author_xml – sequence: 1
  givenname: Ye
  surname: Bai
  fullname: Bai, Ye
  organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
– sequence: 2
  givenname: Chenxing
  surname: Li
  fullname: Li, Chenxing
  organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
– sequence: 3
  givenname: Hao
  surname: Li
  fullname: Li, Hao
  organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
– sequence: 4
  givenname: Yuanyuan
  surname: Zhao
  fullname: Zhao, Yuanyuan
  organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
– sequence: 5
  givenname: Xiaorui
  surname: Wang
  fullname: Wang, Xiaorui
  organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
BookMark eNo1kMFOAjEURavRREX-wMT-wODrtJ3XLpGgYiAmDhpdkU7ngVVsycywwK8Xo97NzbmLs7hn7CimSIxdChgIAfZqMpqNNWqtBjnkaiCgMKgQD1jfojVSg7QgtD5kp8IqnaExLyes37bvsA8qZUGestf7FGK33vFH8mkVw1eIK15uiPwbd7Hm5Z5_pucUPLX82rVU8xT5bLvuQjZ37QcfbuuQeJm2jSde0sY1rgspnrPjpVu31P_rHnu6Gc9Hd9n04XYyGk6zILDoMmty4_OlNmBBV7ly5IE0VQbRkDAoKykESVN4kFgA5s5YpGVtwSuqci177OLXG4hosWnCp2t2i_835Dfu2VVu
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICME57554.2024.10687477
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350390155
EISSN 1945-788X
EndPage 6
ExternalDocumentID 10687477
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IPLJI
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253
IEDL.DBID RIE
IngestDate Wed Aug 27 02:20:32 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253
PageCount 6
ParticipantIDs ieee_primary_10687477
PublicationCentury 2000
PublicationDate 2024-July-15
PublicationDateYYYYMMDD 2024-07-15
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-July-15
  day: 15
PublicationDecade 2020
PublicationTitle Proceedings (IEEE International Conference on Multimedia and Expo)
PublicationTitleAbbrev ICME
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000744903
Score 1.8781964
Snippet In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Accuracy
Benchmark testing
Complexity theory
lyrics recognition
multi-task audio source separation
Multitasking
Robustness
Source separation
Speech recognition
Title Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
URI https://ieeexplore.ieee.org/document/10687477
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA1uTz7Ny8Q7efC1NWmaJn3UsTEHDrGbzKfRJl9xTNqh7YP79Sa9KAqCbyW0EHLp-XJyzvchdMUUxKGJYx0KQWqvGbWTCKmdlBu4AC0kr1W-02A89ycLvmjM6pUXBgAq8Rm49rG6y9e5Ki1VZnZ4IE34KzqoY9ZZbdb6IlQMFvohYY2Gi5Lw-m5wPzTRSEWdeL7bfv2jjkoFI6MemrYdqNUja7csEldtf-Vm_HcP91D_27GHH76waB_tQHaAem3JBtzs4EP0PMlXWfH6gR9r4dDWvI2jDYB6wXGmcWTrF5mmp9z-QPCtwTiN8wxXPl1nFr-v8U2pVzmOKtIfR1DnDs-zPpqPhrPB2GmqKzgrKoLCCc1ZS3kplyQkPPH8GBQBDokUQgKVgiWMUmAyUMQKZ4QXy1BAqkOifHMc5-wIdbM8g2OEEy195rFUmfDFV1THTMQEIA3MImBxACeob4dquakTaCzbUTr9o_0M7doZsxQq5eeoW7yVcGGwv0guqzn_BM5srWo
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA06H_RpXibezYOvrW3TNOmjjo1tbkPsJvNptMlXHJN2aPvgfr1JLxMFwbcSWgi59Hw5Oef7ELohAkJfxbGGDV6srxmlETEujZgquADJOC1VvmOvN3UHMzqrzOqFFwYACvEZmPqxuMuXqcg1VaZ2uMdV-Mu20Y4CfpeWdq0NpaLQ0PUtUqm4bMu_7bdHHRWPFOSJ45r19z8qqRRA0m2icd2FUj-yNPMsMsX6V3bGf_dxH7W-PXv4cYNGB2gLkkPUrIs24GoPH6GXQbpIsrdP_FRKh9bqbRysAMQrDhOJA13BSDU9p_oXgu8VykmcJrhw6hqT8GOJ73K5SHFQ0P44gDJ7eJq00LTbmbR7RlVfwVjYzMsMX522hBNTbvkWjRw3BGEBhYgzxsHmjETEtoFwT1haOsOckPsMYulbwlUHckqOUSNJEzhBOJLcJQ6JhQpgXGHLkLDQAog9tQxI6MEpaumhmq_KFBrzepTO_mi_Rru9yWg4H_bHD-doT8-eJlRteoEa2XsOlyoSyKKrYv6_ABLfsLc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Jointly+Recognizing+Speech+and+Singing+Voices+Based+on+Multi-Task+Audio+Source+Separation&rft.au=Bai%2C+Ye&rft.au=Li%2C+Chenxing&rft.au=Li%2C+Hao&rft.au=Zhao%2C+Yuanyuan&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687477&rft.externalDocID=10687477