Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation

In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a mul...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors	Bai, Ye, Li, Chenxing, Li, Hao, Zhao, Yuanyuan, Wang, Xiaorui
Format	Conference Proceeding
Language	English
Published	IEEE 15.07.2024
Subjects	Accuracy Benchmark testing Complexity theory lyrics recognition multi-task audio source separation Multitasking Robustness Source separation Speech recognition
Online Access	Get full text
ISSN	1945-788X
DOI	10.1109/ICME57554.2024.10687477

Cover

Abstract	In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
AbstractList	In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
Author	Zhao, Yuanyuan Wang, Xiaorui Bai, Ye Li, Chenxing Li, Hao
Author_xml	– sequence: 1 givenname: Ye surname: Bai fullname: Bai, Ye organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 2 givenname: Chenxing surname: Li fullname: Li, Chenxing organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 3 givenname: Hao surname: Li fullname: Li, Hao organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 4 givenname: Yuanyuan surname: Zhao fullname: Zhao, Yuanyuan organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 5 givenname: Xiaorui surname: Wang fullname: Wang, Xiaorui organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China
BookMark	eNo1kMFOAjEURavRREX-wMT-wODrtJ3XLpGgYiAmDhpdkU7ngVVsycywwK8Xo97NzbmLs7hn7CimSIxdChgIAfZqMpqNNWqtBjnkaiCgMKgQD1jfojVSg7QgtD5kp8IqnaExLyes37bvsA8qZUGestf7FGK33vFH8mkVw1eIK15uiPwbd7Hm5Z5_pucUPLX82rVU8xT5bLvuQjZ37QcfbuuQeJm2jSde0sY1rgspnrPjpVu31P_rHnu6Gc9Hd9n04XYyGk6zILDoMmty4_OlNmBBV7ly5IE0VQbRkDAoKykESVN4kFgA5s5YpGVtwSuqci177OLXG4hosWnCp2t2i_835Dfu2VVu
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICME57554.2024.10687477
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798350390155
EISSN	1945-788X
EndPage	6
ExternalDocumentID	10687477
Genre	orig-research
GroupedDBID	6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI OCL RIE RIL RNS
ID	FETCH-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:20:32 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253
PageCount	6
ParticipantIDs	ieee_primary_10687477
PublicationCentury	2000
PublicationDate	2024-July-15
PublicationDateYYYYMMDD	2024-07-15
PublicationDate_xml	– month: 07 year: 2024 text: 2024-July-15 day: 15
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE International Conference on Multimedia and Expo)
PublicationTitleAbbrev	ICME
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000744903
Score	1.8781964
Snippet	In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Accuracy Benchmark testing Complexity theory lyrics recognition multi-task audio source separation Multitasking Robustness Source separation Speech recognition
Title	Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
URI	https://ieeexplore.ieee.org/document/10687477
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA1uTz7Ny8Q7efC1NWmaJn3UsTEHDrGbzKfRJl9xTNqh7YP79Sa9KAqCbyW0EHLp-XJyzvchdMUUxKGJYx0KQWqvGbWTCKmdlBu4AC0kr1W-02A89ycLvmjM6pUXBgAq8Rm49rG6y9e5Ki1VZnZ4IE34KzqoY9ZZbdb6IlQMFvohYY2Gi5Lw-m5wPzTRSEWdeL7bfv2jjkoFI6MemrYdqNUja7csEldtf-Vm_HcP91D_27GHH76waB_tQHaAem3JBtzs4EP0PMlXWfH6gR9r4dDWvI2jDYB6wXGmcWTrF5mmp9z-QPCtwTiN8wxXPl1nFr-v8U2pVzmOKtIfR1DnDs-zPpqPhrPB2GmqKzgrKoLCCc1ZS3kplyQkPPH8GBQBDokUQgKVgiWMUmAyUMQKZ4QXy1BAqkOifHMc5-wIdbM8g2OEEy195rFUmfDFV1THTMQEIA3MImBxACeob4dquakTaCzbUTr9o_0M7doZsxQq5eeoW7yVcGGwv0guqzn_BM5srWo
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA06H_RpXibezYOvrW3TNOmjjo1tbkPsJvNptMlXHJN2aPvgfr1JLxMFwbcSWgi59Hw5Oef7ELohAkJfxbGGDV6srxmlETEujZgquADJOC1VvmOvN3UHMzqrzOqFFwYACvEZmPqxuMuXqcg1VaZ2uMdV-Mu20Y4CfpeWdq0NpaLQ0PUtUqm4bMu_7bdHHRWPFOSJ45r19z8qqRRA0m2icd2FUj-yNPMsMsX6V3bGf_dxH7W-PXv4cYNGB2gLkkPUrIs24GoPH6GXQbpIsrdP_FRKh9bqbRysAMQrDhOJA13BSDU9p_oXgu8VykmcJrhw6hqT8GOJ73K5SHFQ0P44gDJ7eJq00LTbmbR7RlVfwVjYzMsMX522hBNTbvkWjRw3BGEBhYgzxsHmjETEtoFwT1haOsOckPsMYulbwlUHckqOUSNJEzhBOJLcJQ6JhQpgXGHLkLDQAog9tQxI6MEpaumhmq_KFBrzepTO_mi_Rru9yWg4H_bHD-doT8-eJlRteoEa2XsOlyoSyKKrYv6_ABLfsLc
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Jointly+Recognizing+Speech+and+Singing+Voices+Based+on+Multi-Task+Audio+Source+Separation&rft.au=Bai%2C+Ye&rft.au=Li%2C+Chenxing&rft.au=Li%2C+Hao&rft.au=Zhao%2C+Yuanyuan&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687477&rft.externalDocID=10687477