Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation
In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a mul...
Saved in:
Published in | Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.07.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 1945-788X |
DOI | 10.1109/ICME57554.2024.10687477 |
Cover
Abstract | In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio. |
---|---|
AbstractList | In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio. |
Author | Zhao, Yuanyuan Wang, Xiaorui Bai, Ye Li, Chenxing Li, Hao |
Author_xml | – sequence: 1 givenname: Ye surname: Bai fullname: Bai, Ye organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 2 givenname: Chenxing surname: Li fullname: Li, Chenxing organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 3 givenname: Hao surname: Li fullname: Li, Hao organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 4 givenname: Yuanyuan surname: Zhao fullname: Zhao, Yuanyuan organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China – sequence: 5 givenname: Xiaorui surname: Wang fullname: Wang, Xiaorui organization: Chinese Academy of Sciences,Institute of Automation,Beijing,China |
BookMark | eNo1kMFOAjEURavRREX-wMT-wODrtJ3XLpGgYiAmDhpdkU7ngVVsycywwK8Xo97NzbmLs7hn7CimSIxdChgIAfZqMpqNNWqtBjnkaiCgMKgQD1jfojVSg7QgtD5kp8IqnaExLyes37bvsA8qZUGestf7FGK33vFH8mkVw1eIK15uiPwbd7Hm5Z5_pucUPLX82rVU8xT5bLvuQjZ37QcfbuuQeJm2jSde0sY1rgspnrPjpVu31P_rHnu6Gc9Hd9n04XYyGk6zILDoMmty4_OlNmBBV7ly5IE0VQbRkDAoKykESVN4kFgA5s5YpGVtwSuqci177OLXG4hosWnCp2t2i_835Dfu2VVu |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICME57554.2024.10687477 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798350390155 |
EISSN | 1945-788X |
EndPage | 6 |
ExternalDocumentID | 10687477 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI OCL RIE RIL RNS |
ID | FETCH-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:20:32 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i176t-9828c2f580905b24aec0e5eb8778e1873b311e386c0376072a897efd90c4eb253 |
PageCount | 6 |
ParticipantIDs | ieee_primary_10687477 |
PublicationCentury | 2000 |
PublicationDate | 2024-July-15 |
PublicationDateYYYYMMDD | 2024-07-15 |
PublicationDate_xml | – month: 07 year: 2024 text: 2024-July-15 day: 15 |
PublicationDecade | 2020 |
PublicationTitle | Proceedings (IEEE International Conference on Multimedia and Expo) |
PublicationTitleAbbrev | ICME |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000744903 |
Score | 1.8781964 |
Snippet | In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Accuracy Benchmark testing Complexity theory lyrics recognition multi-task audio source separation Multitasking Robustness Source separation Speech recognition |
Title | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation |
URI | https://ieeexplore.ieee.org/document/10687477 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA1uTz7Ny8Q7efC1NWmaJn3UsTEHDrGbzKfRJl9xTNqh7YP79Sa9KAqCbyW0EHLp-XJyzvchdMUUxKGJYx0KQWqvGbWTCKmdlBu4AC0kr1W-02A89ycLvmjM6pUXBgAq8Rm49rG6y9e5Ki1VZnZ4IE34KzqoY9ZZbdb6IlQMFvohYY2Gi5Lw-m5wPzTRSEWdeL7bfv2jjkoFI6MemrYdqNUja7csEldtf-Vm_HcP91D_27GHH76waB_tQHaAem3JBtzs4EP0PMlXWfH6gR9r4dDWvI2jDYB6wXGmcWTrF5mmp9z-QPCtwTiN8wxXPl1nFr-v8U2pVzmOKtIfR1DnDs-zPpqPhrPB2GmqKzgrKoLCCc1ZS3kplyQkPPH8GBQBDokUQgKVgiWMUmAyUMQKZ4QXy1BAqkOifHMc5-wIdbM8g2OEEy195rFUmfDFV1THTMQEIA3MImBxACeob4dquakTaCzbUTr9o_0M7doZsxQq5eeoW7yVcGGwv0guqzn_BM5srWo |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS8MwGA06H_RpXibezYOvrW3TNOmjjo1tbkPsJvNptMlXHJN2aPvgfr1JLxMFwbcSWgi59Hw5Oef7ELohAkJfxbGGDV6srxmlETEujZgquADJOC1VvmOvN3UHMzqrzOqFFwYACvEZmPqxuMuXqcg1VaZ2uMdV-Mu20Y4CfpeWdq0NpaLQ0PUtUqm4bMu_7bdHHRWPFOSJ45r19z8qqRRA0m2icd2FUj-yNPMsMsX6V3bGf_dxH7W-PXv4cYNGB2gLkkPUrIs24GoPH6GXQbpIsrdP_FRKh9bqbRysAMQrDhOJA13BSDU9p_oXgu8VykmcJrhw6hqT8GOJ73K5SHFQ0P44gDJ7eJq00LTbmbR7RlVfwVjYzMsMX522hBNTbvkWjRw3BGEBhYgzxsHmjETEtoFwT1haOsOckPsMYulbwlUHckqOUSNJEzhBOJLcJQ6JhQpgXGHLkLDQAog9tQxI6MEpaumhmq_KFBrzepTO_mi_Rru9yWg4H_bHD-doT8-eJlRteoEa2XsOlyoSyKKrYv6_ABLfsLc |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Jointly+Recognizing+Speech+and+Singing+Voices+Based+on+Multi-Task+Audio+Source+Separation&rft.au=Bai%2C+Ye&rft.au=Li%2C+Chenxing&rft.au=Li%2C+Hao&rft.au=Zhao%2C+Yuanyuan&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687477&rft.externalDocID=10687477 |