Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/so...

Full description

Saved in:

Bibliographic Details
Published in	Acoustical Science and Technology Vol. 46; no. 3; pp. 197 - 209
Main Authors	Tawara, Naohiro, Sato, Hiroshi, Delcroix, Marc, Nakatani, Tomohiro, Araki, Shoko, Ashihara, Takanori, Moriya, Takafumi, Ochiai, Tsubasa
Format	Journal Article
Language	English
Published	Tokyo ACOUSTICAL SOCIETY OF JAPAN 01.05.2025 一般社団法人日本音響学会 Japan Science and Technology Agency
Subjects	Audio data Audio processing Automatic speech recognition Information retrieval Neural networks Personalized voice activity detection Sound sources Speech processing Speech recognition Target detection Target speaker automatic speech recognition Target speech extraction Voice activity detectors Voice recognition
Online Access	Get full text
ISSN	1346-3969 1347-5177
DOI	10.1250/ast.e24.124

Cover

Loading…

Abstract	This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.
AbstractList	This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.
ArticleNumber	e24.124
Author	Ashihara, Takanori Araki, Shoko Nakatani, Tomohiro Ochiai, Tsubasa Sato, Hiroshi Tawara, Naohiro Moriya, Takafumi Delcroix, Marc
Author_xml	– sequence: 1 fullname: Tawara, Naohiro organization: NTT Communication Science Laboratories – sequence: 1 fullname: Sato, Hiroshi organization: NTT Communication Science Laboratories – sequence: 1 fullname: Delcroix, Marc organization: NTT Communication Science Laboratories – sequence: 1 fullname: Nakatani, Tomohiro organization: NTT Communication Science Laboratories – sequence: 1 fullname: Araki, Shoko organization: NTT Communication Science Laboratories – sequence: 1 fullname: Ashihara, Takanori organization: NTT Communication Science Laboratories – sequence: 1 fullname: Moriya, Takafumi organization: NTT Communication Science Laboratories – sequence: 1 fullname: Ochiai, Tsubasa organization: NTT Communication Science Laboratories
BackLink	https://cir.nii.ac.jp/crid/1390866215976202240$$DView record in CiNii
BookMark	eNo9kMtOJCEUhsnESUadWc0LkOjOlHIrqHJnjLfExIXOmtBwqpueFlqg0vr2Ul3Gzblwfr7_5ByhgxADIPSXknPKWnJhcjkHJmojfqBDyoVqWqrUwb6WDe9l_wsd5bwmhIm-lYfo7cWkJRSc4xgc9mGI6dUUHwOG95KMncpL_LwFsCtsqsSMzke8TdFCzj4s8c6XFQ4wJrOpqexi-p-xjcH56S84XFllNrGbEfJv9HMwmwx_vvIx-nd783J93zw-3T1cXz02VlBaGtdakB3t3IL3RA79ooeOEdkJJ-rcGCOcMs4O4LjivYVFfQJlmZNWCjUofoxOZm7d9a36Fr2OYwrVUnPGCRFEClJVZ7PKpphzgkFvk3816UNToqeb6npTXW9aG1HVp7M6eK-tnyKt63VSMtr2SjLC2B56NcvWuZglfCNNKt5uYI8UUvMpfKG_Z3ZlkobAPwFhFpDb
Cites_doi	10.1016/j.neuroimage.2020.117282 10.1109/TASLP.2024.3426924 10.1109/TASLP.2023.3304482 10.1109/ICASSP40776.2020.9053396 10.1109/SLT61566.2024.10832160 10.1109/ICASSP40776.2020.9054266 10.1121/1.382599 10.1007/978-3-030-01246-5_35 10.1109/WASPAA.2019.8937253 10.1109/ICASSP49660.2025.10888410 10.1109/OJSP.2020.3045349 10.21437/Interspeech.2020-1602 10.1109/ASRU46091.2019.9004016 10.1109/ICASSP40776.2020.9054693 10.21437/Interspeech.2019-1130 10.1109/ICASSP.2019.8683007 10.21437/Interspeech.2022-11425 10.21437/Interspeech.2023-1280 10.1109/TASLP.2022.3221000 10.1109/TASLP.2021.3078883 10.21437/Odyssey.2020-62 10.1109/TASLP.2021.3133208 10.1109/ICASSP39728.2021.9414333 10.21437/CHiME.2020-1 10.21437/CHiME.2018-8 10.1109/JSTSP.2022.3188113 10.3115/1075527.1075614 10.1109/ICASSP40776.2020.9054683 10.1016/S0893-6080(00)00026-5 10.1109/TASLP.2019.2915167 10.53829/ntr202012fa6 10.1145/3503161.3548397 10.1121/1.1907229 10.1109/JPROC.2020.3018668 10.21437/CHiME.2024-1 10.1016/j.csl.2008.11.001 10.21437/Interspeech.2022-11252 10.21437/Interspeech.2017-667 10.1109/MSP.2013.2296173 10.1109/ICOSP.2014.7015050 10.1609/aaai.v32i1.11671 10.21437/Interspeech.2024-787 10.1109/ICASSP.2016.7471631 10.1109/JSTSP.2022.3207050 10.1109/ICASSP39728.2021.9414003 10.21437/Interspeech.2018-1400 10.1109/TASLP.2023.3328283 10.1109/TASLP.2016.2647702 10.21437/Interspeech.2019-1126 10.1007/978-3-030-01231-1_39 10.21437/Interspeech.2019-1513 10.21437/Interspeech.2020-2210 10.21437/Interspeech.2022-10894 10.21437/Interspeech.2023-2280 10.1109/JSTSP.2019.2922820 10.1109/LSP.2024.3383794 10.1109/ASRU.2017.8268910 10.1016/j.csl.2021.101317 10.1109/TASLP.2024.3492793 10.1006/csla.1994.1016 10.1109/ACCESS.2023.3243690 10.3758/s13414-015-0882-9 10.1109/97.736233 10.1121/1.392786 10.1109/ICASSP.2017.7952154 10.1109/TASLP.2022.3190739 10.1007/978-3-030-31372-2_17 10.1145/3197517.3201357 10.21437/Interspeech.2021-1939 10.21437/Interspeech.2019-1101 10.1186/s13634-016-0306-6 10.1109/ICASSP40776.2020.9053513 10.1109/TASLP.2021.3066303 10.1016/j.csl.2016.10.005 10.1017/ATSIP.2019.5 10.1109/ICASSP.2018.8462661 10.1109/ICASSP.2006.1660092 10.1109/ICASSP48485.2024.10448315
ContentType	Journal Article
Copyright	2025 by The Acoustical Society of Japan 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: 2025 by The Acoustical Society of Japan – notice: 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	RYH AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M
DOI	10.1250/ast.e24.124
DatabaseName	CiNii Complete CrossRef Electronics & Communications Abstracts Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Aerospace Database Advanced Technologies Database with Aerospace
DatabaseTitle	CrossRef Aerospace Database Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts
DatabaseTitleList	Aerospace Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Physics
EISSN	1347-5177
EndPage	209
ExternalDocumentID	10_1250_ast_e24_124 article_ast_46_3_46_e24_124_article_char_en
GroupedDBID	23M 2WC 5GY 6J9 ACGFO ACIWK ALMA_UNASSIGNED_HOLDINGS CS3 E3Z EBS EJD GX1 JSF JSH KQ8 OVT RJT RNS RZJ TR2 XSB 6TJ 7.U RYH TKC AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M
ID	FETCH-LOGICAL-c411t-d5ce6818db3906f9b9e820684d4411aaa4d7adcfed3739cebaaae7c2d6c647f73
ISSN	1346-3969
IngestDate	Tue Jul 15 09:40:36 EDT 2025 Sun Jul 06 05:05:48 EDT 2025 Fri Jun 27 00:54:24 EDT 2025 Wed Sep 03 06:30:29 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	3
Language	English
License	https://creativecommons.org/licenses/by-nd/4.0
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c411t-d5ce6818db3906f9b9e820684d4411aaa4d7adcfed3739cebaaae7c2d6c647f73
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	https://www.jstage.jst.go.jp/article/ast/46/3/46_e24.124/_article/-char/en
PQID	3230040640
PQPubID	1966373
PageCount	13
ParticipantIDs	proquest_journals_3230040640 crossref_primary_10_1250_ast_e24_124 nii_cinii_1390866215976202240 jstage_primary_article_ast_46_3_46_e24_124_article_char_en
PublicationCentury	2000
PublicationDate	2025-05-01
PublicationDateYYYYMMDD	2025-05-01
PublicationDate_xml	– month: 05 year: 2025 text: 2025-05-01 day: 01
PublicationDecade	2020
PublicationPlace	Tokyo
PublicationPlace_xml	– name: Tokyo
PublicationTitle	Acoustical Science and Technology
PublicationTitleAlternate	Acoustical Science and Technology
PublicationTitle_FL	Acoust. Sci. & Tech
PublicationYear	2025
Publisher	ACOUSTICAL SOCIETY OF JAPAN 一般社団法人日本音響学会 Japan Science and Technology Agency
Publisher_xml	– name: ACOUSTICAL SOCIETY OF JAPAN – name: 一般社団法人日本音響学会 – name: Japan Science and Technology Agency
References	69) S. Cornell, T. Park, S. Huang, C. Boeddeker, X. Chang, M. Maciejewski, M. Wiesner, P. Garcia and S. Watanabe, "The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2024, pp. 1–6 (2024). 14) C. Xu, W. Rao, E. S. Chng and H. Li, "Time-domain speaker extraction network," Proc. ASRU 2019, pp. 327–334 (2019). 52) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming target-speaker ASR with neural transducer," Proc. Interspeech 2022, pp. 2673–2677 (2022). 77) A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711 (2012). 39) S. Wang, Z. Chen, K. A. Lee, Y. Qian and H. Li, "Overview of speaker modeling and its applications: From the lens of deep speaker representation learning," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 4971–4998 (2024). 48) I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux and J. R. Hershey, "Universal sound separation," Proc. WASPAA 2019, pp. 175–179 (2019). 55) Y. Luo, Z. Chen and T. Yoshioka, "Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation," Proc. ICASSP 2020, pp. 46–50 (2020). 53) J. Sohn, N. S. Kim and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Process. Lett., 6, 1–3 (1999). 11) P. Denisov and N. T. Vu, "End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning," Proc. Interspeech 2019, pp. 4425–4429 (2019). 22) J. Hershey and M. Casey, "Audio-visual sound separation via Hidden Markov Models," Proc. Advances in Neural Information Processing Systems 14 (2001). 86) T. Ashihara, T. Moriya, S. Horiguchi, J. Peng, T. Ochiai, M. Delcroix, K. Matsuura and H. Sato, "Investigation of speaker representation for target-speaker speech processing," arXiv preprint arXiv:2410.11243 (2024). 54) T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa and T. Nakatani, "Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues," Proc. Interspeech 2019, 31, 2718–2722 (2019). 9) M. Delcroix, K. Žmolíková, K. Kinoshita, A. Ogawa and T. Nakatani, "Single channel target speaker extraction and recognition with SpeakerBeam," Proc. ICASSP 2018, pp. 5554–5558 (2018). 75) K. Maekawa, "Corpus of spontaneous Japanese: Its design and evaluation," Proc. ISCA & IEEE Workshop Spontaneous Speech Processing and Recognition 2003 (2003). 19) J. H. Lee, H.-S. Choi and K. Lee, "Audio query-based music source separation," Proc. ISMIR 2019, pp. 878–885 (2019). 63) M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani and S. Araki, "Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam," Proc. ICASSP 2020, pp. 691–695 (2020). 61) D. B. Paul and J. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. Workshop Speech and Natural Language, pp. 357–362 (1992). 71) K. Kinoshita, M. Delcroix and N. Tawara, "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds," Proc. ICASSP 2021, pp. 7198–7202 (2021). 8) I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev and A. Romanenko, "Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario," Proc. Interspeech 2020, pp. 274–278 (2020). 89) W. Zhang and Y. Qian, "Weakly-supervised speech pre-training: A case study on target speech recognition," Proc. Interspeech 2023, pp. 3517–3521 (2023). 80) M. Delcroix, K. Kinoshita, T. Ochiai, K. Zmolikova, H. Sato and T. Nakatani, "Listen only to me! How well can target speech extraction handle false alarms?" Proc. Interspeech 2022, pp. 216–220 (2022). 3) K. Žmolíková, "Neural target speech extraction," Ph.D. thesis, Faculty of Information Technology, Brno University of Technology (2022). 47) D. Yu, M. Kolbaek, Z.-H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," Proc. ICASSP 2017, pp. 241–245 (2017). 58) J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small-room acoustics," J. Acoust. Soc. Am., 65, 943–950 (1979). 7) S. Ding, Q. Wang, S.-Y. Chang, L. Wan and I. Lopex Moreno, "Personal VAD: Speaker-conditioned voice activity detection," Proc. Odyssey 2020, pp. 433–439 (2020). 32) T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita and S. Araki, "Listen to what you want: Neural network-based universal sound selector," Proc. Interspeech 2020, pp. 1441–1445 (2020). 91) J. Lin, M. Ge, W. Wang, H. Li and M. Feng, "Selective HuBERT: Self-supervised pre-training for target speaker in clean and mixture speech," IEEE Signal Processing Lett., 31, 1014–1018 (2024). 82) S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu and F. Wei, "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J. Sel. Top. Signal Process., 16, 1505–1518 (2022). 90) J. Lin, M. Ge, J. Ao, L. Deng and H. Li, "SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech," Proc. Interspeech 2024, pp. 597–601 (2024). 23) B. Rivet, W. Wang, S. M. Naqvi and J. A. Chambers, "Audiovisual speech source separation: An overview of key methodologies," IEEE Signal Process. Mag., 31, 125–134 (2014). 51) P. Bell, J. Fainberg, O. Klejch, J. Li, S. Renals and P. Swietojanski, "Adaptation algorithms for neural network-based speech recognition: An overview," IEEE Open J. Signal Process., 2, 33–66 (2020). 6) D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz and C. Jutten, "An analysis of visual speech information applied to voice activity detection," Proc. ICASSP 2006, Vol. 1, pp. 601–604 (2006). 1) E. C. Cherry, "Some experiments on the recognition of speech, with one and with two ears," J. Acoust. Soc. Am., 25, 975–979 (1953). 45) J. Du, Y. Tu, Y. Xu, L. Dai and C.-H. Lee, "Speech separation of a target speaker based on deep neural networks," Proc. ICSP 2014, pp. 473–477 (2014). 5) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Learning speaker representation for neural network based multichannel speaker extraction," Proc. ASRU 2017, pp. 8–15 (2017). 92) E. Ceolini, J. Hjortkjær, D. D. E. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu and N. Mesgarani, "Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception," NeuroImage, 223, 117282 (2020). 38) R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix and T. Nakatani, "Far-field automatic speech recognition," Proc. IEEE, 109, 124–148 (2021). 13) Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. Lopez Moreno, "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," Proc. Interspeech 2019, pp. 2728–2732 (2019). 56) Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim and S. Watanabe, "TF-GridNet: Integrating full- and sub-band modeling for speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., pp. 3221–3236 (2023). 68) S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. Ben Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka and N. Ryant, "CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. International Workshop Speech Processing in Everyday Environments (CHiME) 2020, pp. 1–7 (2020). 79) M. Borsdorf, C. Xu, H. Li and T. Schultz, "Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers," Proc. Interspeech 2021, pp. 1469–1473 (2021). 44) J. R. Hershey, S. J. Rennie, P. A. Olsen and T. T. Kristjansson, "Super-human multi-talker speech recognition: A graphical modeling approach," Comput. Speech Lang., 24, 45–66 (2010). 4) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Speaker-aware neural network based beamformer for speaker extraction in speech mixtures," Proc. Interspeech 2017, pp. 2655–2659 (2017). 43) H. Sawada, N. Ono, H. Kameoka, D. Kitamura and H. Saruwatari, "A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF," APSIPA Trans. Signal Inf. Process., 8, e12 (2019). 83) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever, "Robust speech recognition via large-scale weak supervision," Proc. ICML 2023, pp. 28492–28518 (2023). 27) A. Owens and A. A. Efros, "Audio-visual scene analysis with self-supervised multisensory features," Proc. ECCV 2018, pp. 631–648 (2018). 72) N. Kamo, N. Tawara, A. Ando, T. Kano, H. Sato, R. Ikeshita, T. Moriya, S. Horiguchi, K. Matsuura, A. Ogawa, A. Plaquet, T. Ashihara, T. Ochiai, M. Mimura, M. Delcroix, T. Nakatani, T. Asami and S. Araki, "Microphone array geometry independent multi-talker distant ASR: NTT system for the DASR task of the CHiME-8 challenge," Comput. Speech Lang. (under review). 25) T. Afouras, J. S. Chung and A. Zisserman, "The conversation: Deep audio-visual speech enhancement," Proc. Interspeech 2018, pp. 3244–3248 (2018). 57) E. Perez, F. Strub, H. de Vries, V. Dumoulin and A. Courville, "FiLM: Visual reasoning with a general conditioning layer," Proc. 32nd AAAI 2018 (2018). 84) A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath and S. Watanabe, "Self-supervised speech representation learning: A review," IEEE J. Sel. Top. Signal Process., 16, 1179–1210 (2022). 64) M. Fukui, S. Saito and K. Kobayashi, "Media-processing technologies for ultimate private sound space," NTT Tech. Rev., 18(12) 44 88 45 89 46 47 48 49 90 91 92 50 51 52 53 10 54 11 55 12 56 13 57 14 58 15 59 16 17 18 19 1 2 3 4 5 6 7 8 9 60 61 62 63 20 64 21 65 22 66 23 67 24 68 25 69 26 27 28 29 70 71 72 73 30 74 31 75 32 76 33 77 34 78 35 79 36 37 38 39 80 81 82 83 40 84 41 85 42 86 43 87
References_xml	– reference: 54) T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa and T. Nakatani, "Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues," Proc. Interspeech 2019, 31, 2718–2722 (2019). – reference: 56) Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim and S. Watanabe, "TF-GridNet: Integrating full- and sub-band modeling for speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., pp. 3221–3236 (2023). – reference: 76) J. Barker, R. Marxer, E. Vincent and S. Watanabe, "The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes," Comput. Speech Lang., 46, 605–626 (2017). – reference: 90) J. Lin, M. Ge, J. Ao, L. Deng and H. Li, "SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech," Proc. Interspeech 2024, pp. 597–601 (2024). – reference: 47) D. Yu, M. Kolbaek, Z.-H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," Proc. ICASSP 2017, pp. 241–245 (2017). – reference: 15) J. Heitkaemper, T. Fehér, M. Freitag and R. Haeb-Umbach, "A study on online source extraction in the presence of changing speaker positions," Proc. SLSP 2019, pp. 198–209 (2019). – reference: 68) S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. Ben Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka and N. Ryant, "CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. International Workshop Speech Processing in Everyday Environments (CHiME) 2020, pp. 1–7 (2020). – reference: 41) G. J. Brown and M. Cooke, "Computational auditory scene analysis," Comput. Speech Lang., 8, 297–336 (1994). – reference: 11) P. Denisov and N. T. Vu, "End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning," Proc. Interspeech 2019, pp. 4425–4429 (2019). – reference: 22) J. Hershey and M. Casey, "Audio-visual sound separation via Hidden Markov Models," Proc. Advances in Neural Information Processing Systems 14 (2001). – reference: 24) D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu and J. Jensen, "An overview of deep-learning-based audio-visual speech enhancement and separation," IEEE/ACM Trans. Audio Speech Lang. Process., 29, 1368–1396 (2021). – reference: 62) Y. Luo and N. Mesgarani, "TasNet: Surpassing ideal time-frequency masking for speech separation," Proc. ICASSP 2018, pp. 1256–1266 (2018). – reference: 85) S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed and H. y. Lee, "SUPERB: Speech Processing Universal PERformance Benchmark," Proc. Interspeech 2021, pp. 1194–1198 (2021). – reference: 87) J. Peng, M. Delcroix, T. Ochiai, O. Plchot, S. Araki and J. Černockỳ, "Target speech extraction with pre-trained self-supervised learning models," Proc. ICASSP 2024, pp. 10421–10425 (2024). – reference: 33) D. Samuel, A. Ganeshan and J. Naradowsky, "Meta-learning extractors for music source separation," Proc. ICASSP 2020, pp. 816–820 (2020). – reference: 91) J. Lin, M. Ge, W. Wang, H. Li and M. Feng, "Selective HuBERT: Self-supervised pre-training for target speaker in clean and mixture speech," IEEE Signal Processing Lett., 31, 1014–1018 (2024). – reference: 9) M. Delcroix, K. Žmolíková, K. Kinoshita, A. Ogawa and T. Nakatani, "Single channel target speaker extraction and recognition with SpeakerBeam," Proc. ICASSP 2018, pp. 5554–5558 (2018). – reference: 27) A. Owens and A. A. Efros, "Audio-visual scene analysis with self-supervised multisensory features," Proc. ECCV 2018, pp. 631–648 (2018). – reference: 65) X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley and W. Wang, "Separate what you describe: Language-queried audio source separation," Proc. Interspeech 2022, pp. 1801–1805 (2022). – reference: 1) E. C. Cherry, "Some experiments on the recognition of speech, with one and with two ears," J. Acoust. Soc. Am., 25, 975–979 (1953). – reference: 81) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming end-to-end target-speaker automatic speech recognition and activity detection," IEEE Access, 11, 13906–13917 (2023). – reference: 26) A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman and M. Rubinstein, "Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation," ACM Trans. Graph., 37(4), No. 112 (2018). – reference: 82) S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu and F. Wei, "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J. Sel. Top. Signal Process., 16, 1505–1518 (2022). – reference: 29) J. L. Flanagan, J. D. Johnston, R. Zahn and G. W. Elko, "Computer-steered microphone arrays for sound transduction in large rooms," J. Acoust. Soc. Am., 78, 1508–1518 (1985). – reference: 46) J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," Proc. ICASSP 2016, pp. 31–35 (2016). – reference: 23) B. Rivet, W. Wang, S. M. Naqvi and J. A. Chambers, "Audiovisual speech source separation: An overview of key methodologies," IEEE Signal Process. Mag., 31, 125–134 (2014). – reference: 77) A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711 (2012). – reference: 39) S. Wang, Z. Chen, K. A. Lee, Y. Qian and H. Li, "Overview of speaker modeling and its applications: From the lens of deep speaker representation learning," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 4971–4998 (2024). – reference: 64) M. Fukui, S. Saito and K. Kobayashi, "Media-processing technologies for ultimate private sound space," NTT Tech. Rev., 18(12), pp. 43–47 (2020). – reference: 63) M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani and S. Araki, "Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam," Proc. ICASSP 2020, pp. 691–695 (2020). – reference: 37) R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter and S. Watanabe, "End-to-end speech recognition: A survey," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 325–351 (2024). – reference: 60) J. Le Roux, S. Wisdom, H. Erdogan and J. R. Hershey, "SDR—Half-baked or well done?" Proc. ICASSP 2019, pp. 626–630 (2019). – reference: 84) A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath and S. Watanabe, "Self-supervised speech representation learning: A review," IEEE J. Sel. Top. Signal Process., 16, 1179–1210 (2022). – reference: 13) Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. Lopez Moreno, "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," Proc. Interspeech 2019, pp. 2728–2732 (2019). – reference: 7) S. Ding, Q. Wang, S.-Y. Chang, L. Wan and I. Lopex Moreno, "Personal VAD: Speaker-conditioned voice activity detection," Proc. Odyssey 2020, pp. 433–439 (2020). – reference: 10) N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu and S. Watanabe, "Auxiliary interference speaker loss for target-speaker speech recognition," Proc. Interspeech 2019, pp. 236–240 (2019). – reference: 40) S. Gannot, E. Vincent, S. Markovich-Golan and A. Ozerov, "A consolidated perspective on multimicrophone speech enhancement and source separation," IEEE/ACM Trans. Audio Speech Lang. Process., 25, 692–730 (2017). – reference: 89) W. Zhang and Y. Qian, "Weakly-supervised speech pre-training: A case study on target speech recognition," Proc. Interspeech 2023, pp. 3517–3521 (2023). – reference: 21) B. Gfeller, D. Roblek and M. Tagliasacchi, "One-shot conditional audio filtering of arbitrary sounds," Proc. ICASSP 2021, pp. 501–505 (2021). – reference: 52) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming target-speaker ASR with neural transducer," Proc. Interspeech 2022, pp. 2673–2677 (2022). – reference: 32) T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita and S. Araki, "Listen to what you want: Neural network-based universal sound selector," Proc. Interspeech 2020, pp. 1441–1445 (2020). – reference: 34) Y. Ohishi, M. Delcroix, T. Ochiai, S. Araki, D. Takeuchi, D. Niizumi, A. Kimura, N. Harada and K. Kashino, "ConceptBeam: Concept driven target speech extraction," Proc. ACM Multimedia 2022, pp. 4252–4260 (2022). – reference: 72) N. Kamo, N. Tawara, A. Ando, T. Kano, H. Sato, R. Ikeshita, T. Moriya, S. Horiguchi, K. Matsuura, A. Ogawa, A. Plaquet, T. Ashihara, T. Ochiai, M. Mimura, M. Delcroix, T. Nakatani, T. Asami and S. Araki, "Microphone array geometry independent multi-talker distant ASR: NTT system for the DASR task of the CHiME-8 challenge," Comput. Speech Lang. (under review). – reference: 92) E. Ceolini, J. Hjortkjær, D. D. E. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu and N. Mesgarani, "Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception," NeuroImage, 223, 117282 (2020). – reference: 12) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget and J. Černockỳ, "SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures," IEEE J. Sel. Top. Signal Process., 13, 800–814 (2019). – reference: 5) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Learning speaker representation for neural network based multichannel speaker extraction," Proc. ASRU 2017, pp. 8–15 (2017). – reference: 57) E. Perez, F. Strub, H. de Vries, V. Dumoulin and A. Courville, "FiLM: Visual reasoning with a general conditioning layer," Proc. 32nd AAAI 2018 (2018). – reference: 3) K. Žmolíková, "Neural target speech extraction," Ph.D. thesis, Faculty of Information Technology, Brno University of Technology (2022). – reference: 2) A. W. Bronkhorst, "The cocktail-party problem revisited: Early processing and selection of multi-talker speech," Atten. Percept. Psychophys., 77, 1465–1487 (2015). – reference: 38) R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix and T. Nakatani, "Far-field automatic speech recognition," Proc. IEEE, 109, 124–148 (2021). – reference: 55) Y. Luo, Z. Chen and T. Yoshioka, "Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation," Proc. ICASSP 2020, pp. 46–50 (2020). – reference: 6) D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz and C. Jutten, "An analysis of visual speech information applied to voice activity detection," Proc. ICASSP 2006, Vol. 1, pp. 601–604 (2006). – reference: 30) R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou and D. Yu, "Neural spatial filter: Target speaker speech separation assisted with directional information," Proc. Interspeech 2019, pp. 4290–4294 (2019). – reference: 45) J. Du, Y. Tu, Y. Xu, L. Dai and C.-H. Lee, "Speech separation of a target speaker based on deep neural networks," Proc. ICSP 2014, pp. 473–477 (2014). – reference: 86) T. Ashihara, T. Moriya, S. Horiguchi, J. Peng, T. Ochiai, M. Delcroix, K. Matsuura and H. Sato, "Investigation of speaker representation for target-speaker speech processing," arXiv preprint arXiv:2410.11243 (2024). – reference: 36) T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe and S. Narayanan, "A review of speaker diarization: Recent advances with deep learning," Comput. Speech Lang., 72, 101317 (2022). – reference: 58) J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small-room acoustics," J. Acoust. Soc. Am., 65, 943–950 (1979). – reference: 8) I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev and A. Romanenko, "Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario," Proc. Interspeech 2020, pp. 274–278 (2020). – reference: 17) J. Málek, J. Janský, Z. Koldovský, T. Kounovský, J. Čmejla and J. Žd'anský, "Target speech extraction: Independent vector extraction guided by supervised speaker identification," IEEE/ACM Trans. Audio Speech Lang. Process., 30, 2295–2309 (2022). – reference: 53) J. Sohn, N. S. Kim and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Process. Lett., 6, 1–3 (1999). – reference: 73) H. Ma, Z. Peng, M. Shao, J. Li and J. Liu, "Extending Whisper with prompt tuning to target-speaker ASR," Proc. ICASSP 2024 (2024). – reference: 75) K. Maekawa, "Corpus of spontaneous Japanese: Its design and evaluation," Proc. ISCA & IEEE Workshop Spontaneous Speech Processing and Recognition 2003 (2003). – reference: 44) J. R. Hershey, S. J. Rennie, P. A. Olsen and T. T. Kristjansson, "Super-human multi-talker speech recognition: A graphical modeling approach," Comput. Speech Lang., 24, 45–66 (2010). – reference: 61) D. B. Paul and J. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. Workshop Speech and Natural Language, pp. 357–362 (1992). – reference: 66) E. Fonseca, X. Favory, J. Pons, F. Font and X. Serra, "FSD50k: An open dataset of human-labeled sound events," IEEE/ACM Trans. Audio Speech Lang. Process., 30, 829–852 (2021). – reference: 4) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Speaker-aware neural network based beamformer for speaker extraction in speech mixtures," Proc. Interspeech 2017, pp. 2655–2659 (2017). – reference: 25) T. Afouras, J. S. Chung and A. Zisserman, "The conversation: Deep audio-visual speech enhancement," Proc. Interspeech 2018, pp. 3244–3248 (2018). – reference: 79) M. Borsdorf, C. Xu, H. Li and T. Schultz, "Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers," Proc. Interspeech 2021, pp. 1469–1473 (2021). – reference: 74) T. Moriya, H. Sato, T. Ochiai, M. Delcroix, T. Ashihara, K. Matsuura, T. Tanaka, R. Masumura, A. Ogawa and T. Asami, "Knowledge distillation for neural transducer-based target-speaker ASR: Exploiting parallel mixture/single-talker speech data," Proc. Interspeech 2023, pp. 899–903 (2023). – reference: 80) M. Delcroix, K. Kinoshita, T. Ochiai, K. Zmolikova, H. Sato and T. Nakatani, "Listen only to me! How well can target speech extraction handle false alarms?" Proc. Interspeech 2022, pp. 216–220 (2022). – reference: 16) J. Janský, J. Málek, J. Čmejla, T. Kounovský, Z. Koldovský and J. Žd'anský, "Adaptive blind audio source extraction supervised by dominant speaker identification using x-vectors," Proc. ICASSP 2020, pp. 676–680 (2020). – reference: 78) T. Ochiai, K. Iwamoto, M. Delcroix, R. Ikeshita, H. Sato, S. Araki and S. Katagiri, "Rethinking processing distortions: Disentangling the impact of speech enhancement errors on speech recognition performance," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 3589–3602 (2024). – reference: 14) C. Xu, W. Rao, E. S. Chng and H. Li, "Time-domain speaker extraction network," Proc. ASRU 2019, pp. 327–334 (2019). – reference: 20) M. Delcroix, J. Bennasar Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi and S. Araki, "SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning," IEEE/ACM Trans. Audio Speech Lang. Process., 31, 121–136 (2022). – reference: 71) K. Kinoshita, M. Delcroix and N. Tawara, "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds," Proc. ICASSP 2021, pp. 7198–7202 (2021). – reference: 19) J. H. Lee, H.-S. Choi and K. Lee, "Audio query-based music source separation," Proc. ISMIR 2019, pp. 878–885 (2019). – reference: 28) H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott and A. Torralba, "The sound of pixels," Proc. ECCV 2018, pp. 570–586 (2018). – reference: 49) J. Yu, S.-X. Zhang, B. Wu, S. Liu, S. Hu, M. Geng, X. Liu, H. Meng and D. Yu, "Audio-visual multi-channel integration and recognition of overlapped speech," IEEE/ACM Trans. Audio Speech Lang. Process., 29, 2067–2082 (2021). – reference: 88) C. Hernandez-Olivan, M. Delcroix, T. Ochiai, D. Niizumi, N. Tawara, T. Nakatani and S. Araki, "SoundBeam meets M2D: Target sound extraction with audio foundation model," arXiv preprint arXiv:2409.12528 (2024). – reference: 43) H. Sawada, N. Ono, H. Kameoka, D. Kitamura and H. Saruwatari, "A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF," APSIPA Trans. Signal Inf. Process., 8, e12 (2019). – reference: 69) S. Cornell, T. Park, S. Huang, C. Boeddeker, X. Chang, M. Maciejewski, M. Wiesner, P. Garcia and S. Watanabe, "The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2024, pp. 1–6 (2024). – reference: 70) C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann and R. Haeb-Umbach, "Front-end processing for the CHiME-5 dinner party scenario," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2018, Vol. 1, pp. 35–40 (2018). – reference: 31) Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang and M. D. Plumbley, "Source separation with weakly labelled data: An approach to computational auditory scene analysis," Proc. ICASSP 2020, pp. 101–105 (2020). – reference: 48) I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux and J. R. Hershey, "Universal sound separation," Proc. WASPAA 2019, pp. 175–179 (2019). – reference: 67) K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr and T. Yoshioka, "A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research," EURASIP J. Adv. Signal Process., 2016, 1–19 (2016). – reference: 18) P. Seetharaman, G. Wichern, S. Venkataramani and J. Le Roux, "Class-conditional embeddings for music source separation," Proc. ICASSP 2019, pp. 301–305 (2019). – reference: 42) A. Hyvärinen and E. Oja, "Independent component analysis: Algorithms and applications," Neural Netw., 13, 411–430 (2000). – reference: 83) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever, "Robust speech recognition via large-scale weak supervision," Proc. ICML 2023, pp. 28492–28518 (2023). – reference: 59) S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747 (2016). – reference: 51) P. Bell, J. Fainberg, O. Klejch, J. Li, S. Renals and P. Swietojanski, "Adaptation algorithms for neural network-based speech recognition: An overview," IEEE Open J. Signal Process., 2, 33–66 (2020). – reference: 35) H. Ma, Z. Peng, X. Li, M. Shao, X. Wu and J. Liu, "CLAPSep: Leveraging contrastive pre-trained model for multi-modal query-conditioned target sound extraction," arXiv preprint arXiv:2402.17455 (2024). – reference: 50) K. Shinoda, "Speaker adaptation techniques for automatic speech recognition," Proc. APSIPA ASC 2011 (2011). – ident: 92 doi: 10.1016/j.neuroimage.2020.117282 – ident: 78 doi: 10.1109/TASLP.2024.3426924 – ident: 56 doi: 10.1109/TASLP.2023.3304482 – ident: 35 – ident: 31 doi: 10.1109/ICASSP40776.2020.9053396 – ident: 86 doi: 10.1109/SLT61566.2024.10832160 – ident: 55 doi: 10.1109/ICASSP40776.2020.9054266 – ident: 58 doi: 10.1121/1.382599 – ident: 28 doi: 10.1007/978-3-030-01246-5_35 – ident: 48 doi: 10.1109/WASPAA.2019.8937253 – ident: 88 doi: 10.1109/ICASSP49660.2025.10888410 – ident: 51 doi: 10.1109/OJSP.2020.3045349 – ident: 8 doi: 10.21437/Interspeech.2020-1602 – ident: 14 doi: 10.1109/ASRU46091.2019.9004016 – ident: 16 doi: 10.1109/ICASSP40776.2020.9054693 – ident: 83 – ident: 11 doi: 10.21437/Interspeech.2019-1130 – ident: 18 doi: 10.1109/ICASSP.2019.8683007 – ident: 52 doi: 10.21437/Interspeech.2022-11425 – ident: 60 – ident: 89 doi: 10.21437/Interspeech.2023-1280 – ident: 20 doi: 10.1109/TASLP.2022.3221000 – ident: 49 doi: 10.1109/TASLP.2021.3078883 – ident: 22 – ident: 7 doi: 10.21437/Odyssey.2020-62 – ident: 66 doi: 10.1109/TASLP.2021.3133208 – ident: 71 doi: 10.1109/ICASSP39728.2021.9414333 – ident: 68 doi: 10.21437/CHiME.2020-1 – ident: 70 doi: 10.21437/CHiME.2018-8 – ident: 82 doi: 10.1109/JSTSP.2022.3188113 – ident: 61 doi: 10.3115/1075527.1075614 – ident: 63 doi: 10.1109/ICASSP40776.2020.9054683 – ident: 59 – ident: 42 doi: 10.1016/S0893-6080(00)00026-5 – ident: 62 doi: 10.1109/TASLP.2019.2915167 – ident: 64 doi: 10.53829/ntr202012fa6 – ident: 34 doi: 10.1145/3503161.3548397 – ident: 1 doi: 10.1121/1.1907229 – ident: 50 – ident: 30 – ident: 38 doi: 10.1109/JPROC.2020.3018668 – ident: 69 doi: 10.21437/CHiME.2024-1 – ident: 75 – ident: 44 doi: 10.1016/j.csl.2008.11.001 – ident: 80 doi: 10.21437/Interspeech.2022-11252 – ident: 4 doi: 10.21437/Interspeech.2017-667 – ident: 23 doi: 10.1109/MSP.2013.2296173 – ident: 45 doi: 10.1109/ICOSP.2014.7015050 – ident: 57 doi: 10.1609/aaai.v32i1.11671 – ident: 90 doi: 10.21437/Interspeech.2024-787 – ident: 72 – ident: 46 doi: 10.1109/ICASSP.2016.7471631 – ident: 84 doi: 10.1109/JSTSP.2022.3207050 – ident: 21 doi: 10.1109/ICASSP39728.2021.9414003 – ident: 25 doi: 10.21437/Interspeech.2018-1400 – ident: 37 doi: 10.1109/TASLP.2023.3328283 – ident: 40 doi: 10.1109/TASLP.2016.2647702 – ident: 10 doi: 10.21437/Interspeech.2019-1126 – ident: 27 doi: 10.1007/978-3-030-01231-1_39 – ident: 54 doi: 10.21437/Interspeech.2019-1513 – ident: 32 doi: 10.21437/Interspeech.2020-2210 – ident: 65 doi: 10.21437/Interspeech.2022-10894 – ident: 85 – ident: 74 doi: 10.21437/Interspeech.2023-2280 – ident: 12 doi: 10.1109/JSTSP.2019.2922820 – ident: 91 doi: 10.1109/LSP.2024.3383794 – ident: 5 doi: 10.1109/ASRU.2017.8268910 – ident: 36 doi: 10.1016/j.csl.2021.101317 – ident: 39 doi: 10.1109/TASLP.2024.3492793 – ident: 3 – ident: 41 doi: 10.1006/csla.1994.1016 – ident: 81 doi: 10.1109/ACCESS.2023.3243690 – ident: 2 doi: 10.3758/s13414-015-0882-9 – ident: 53 doi: 10.1109/97.736233 – ident: 73 – ident: 29 doi: 10.1121/1.392786 – ident: 47 doi: 10.1109/ICASSP.2017.7952154 – ident: 17 doi: 10.1109/TASLP.2022.3190739 – ident: 15 doi: 10.1007/978-3-030-31372-2_17 – ident: 19 – ident: 26 doi: 10.1145/3197517.3201357 – ident: 79 doi: 10.21437/Interspeech.2021-1939 – ident: 13 doi: 10.21437/Interspeech.2019-1101 – ident: 67 doi: 10.1186/s13634-016-0306-6 – ident: 77 – ident: 33 doi: 10.1109/ICASSP40776.2020.9053513 – ident: 24 doi: 10.1109/TASLP.2021.3066303 – ident: 76 doi: 10.1016/j.csl.2016.10.005 – ident: 43 doi: 10.1017/ATSIP.2019.5 – ident: 9 doi: 10.1109/ICASSP.2018.8462661 – ident: 6 doi: 10.1109/ICASSP.2006.1660092 – ident: 87 doi: 10.1109/ICASSP48485.2024.10448315
SSID	ssj0024956
Score	2.3527725
SecondaryResourceType	review_article
Snippet	This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an...
SourceID	proquest crossref nii jstage
SourceType	Aggregation Database Index Database Publisher
StartPage	197
SubjectTerms	Audio data Audio processing Automatic speech recognition Information retrieval Neural networks Personalized voice activity detection Sound sources Speech processing Speech recognition Target detection Target speaker automatic speech recognition Target speech extraction Voice activity detectors Voice recognition
Title	Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues
URI	https://www.jstage.jst.go.jp/article/ast/46/3/46_e24.124/_article/-char/en https://cir.nii.ac.jp/crid/1390866215976202240 https://www.proquest.com/docview/3230040640
Volume	46
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	Acoustical Science and Technology, 2025/05/01, Vol.46(3), pp.197-209
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELe6ARIviE9R2JAf5qcqo40dp95b2qVMSMADndS3KLFdERhtaRoE_Fv8g9zZaVq0SXxIjZU4d0nl-8V3Tu6DkBMMl-wbyYMQwBEIxfswD3IeRMM4F7KfC2MxwPnNW3lxKV7Polmn83PPa6neFKf6x41xJf8jVegDuWKU7D9Itr0odMA-yBdakDC0fydj58bdq7A0Uq9JgerkCTPu2kcs4Ir__cpa7SPY8tqUy97KRwe072ExqSWIauFdwit0RTc-h5FxHxP8bfRV3TgcbrPW6qWrBdYkFNnGHlx_W_8OHarL3rSqQWm2euDcXoEWKL9hvJBuBb9cl9_z3jT_lM_rz2WLx-pDiaml3YkF0Oy_rQijnW-gwxdLBRsN0YkjhXbIkjFLY5ZAZ8rSiKkRSzhLJRtxpiJHnMCvh10K6CK3M3ZsiqkJ0jU7E-RPzlkiHduYqWRvWudCBlz5ojCndtsXB9GgKSLT6ILcfIW_eaNqAVsR9WYF_KGAQ3FAboWwNkFt8Go22CV4VK5kcHvLJigU2F_uMf9mBt3-CCsBTPFwsCjLa0aBs3Sm98m9ZolCE4-3B6RjFw_JHecqrKtH5ItHHXWoo3uoozvUnVGPOQp4oA5zdIc5ipijHnN0izm6hzkK1_KYow5zj8nlJJ2OL4KmcEegxWCwCUykrQRL0BRc9eVcFcpimYChMGB8D_IcZoA4N3puDY-50raALhvr0EgtRTyP-RNyuID7PSVUD0W_gA1ouOBxoUQRzudxZCzYmbnmXXKyHcds5fOzZLiuheHOYLgzGG44EF1y5se4JWoeWkckZMaxaYjbcxj4CPNMlxyDXDJdYgsrpv5QSrCUwZAPnUXcJUdbiWXN7FBlPMRcdvid_Nkf2J-Tu7sn5Ygcbta1PQZLd1O8cMj6BaozonM
linkProvider	Geneva Foundation for Medical Education and Research
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Target+sound+information+extraction%3A+Speech+and+audio+processing+with+neural+networks+conditioned+on+target+clues&rft.jtitle=Acoustical+Science+and+Technology&rft.au=Ochiai+Tsubasa&rft.au=Delcroix+Marc&rft.au=Moriya+Takafumi&rft.au=Ashihara+Takanori&rft.date=2025-05-01&rft.pub=%E4%B8%80%E8%88%AC%E7%A4%BE%E5%9B%A3%E6%B3%95%E4%BA%BA+%E6%97%A5%E6%9C%AC%E9%9F%B3%E9%9F%BF%E5%AD%A6%E4%BC%9A&rft.issn=1346-3969&rft.eissn=1347-5177&rft.volume=advpub&rft_id=info:doi/10.1250%2Fast.e24.124
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-3969&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-3969&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-3969&client=summon