Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues

This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/so...

Full description

Saved in:
Bibliographic Details
Published inAcoustical Science and Technology Vol. 46; no. 3; pp. 197 - 209
Main Authors Tawara, Naohiro, Sato, Hiroshi, Delcroix, Marc, Nakatani, Tomohiro, Araki, Shoko, Ashihara, Takanori, Moriya, Takafumi, Ochiai, Tsubasa
Format Journal Article
LanguageEnglish
Published Tokyo ACOUSTICAL SOCIETY OF JAPAN 01.05.2025
一般社団法人 日本音響学会
Japan Science and Technology Agency
Subjects
Online AccessGet full text
ISSN1346-3969
1347-5177
DOI10.1250/ast.e24.124

Cover

Loading…
Abstract This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.
AbstractList This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity detection (PVAD), target speaker automatic speech recognition (TS-ASR), etc. We formalize the ideas of TSIE and show how it can be implemented through various examples such as TSE, PVAD, and TS-ASR. We conclude the paper with a discussion of potential future research directions.
ArticleNumber e24.124
Author Ashihara, Takanori
Araki, Shoko
Nakatani, Tomohiro
Ochiai, Tsubasa
Sato, Hiroshi
Tawara, Naohiro
Moriya, Takafumi
Delcroix, Marc
Author_xml – sequence: 1
  fullname: Tawara, Naohiro
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Sato, Hiroshi
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Delcroix, Marc
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Nakatani, Tomohiro
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Araki, Shoko
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Ashihara, Takanori
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Moriya, Takafumi
  organization: NTT Communication Science Laboratories
– sequence: 1
  fullname: Ochiai, Tsubasa
  organization: NTT Communication Science Laboratories
BackLink https://cir.nii.ac.jp/crid/1390866215976202240$$DView record in CiNii
BookMark eNo9kMtOJCEUhsnESUadWc0LkOjOlHIrqHJnjLfExIXOmtBwqpueFlqg0vr2Ul3Gzblwfr7_5ByhgxADIPSXknPKWnJhcjkHJmojfqBDyoVqWqrUwb6WDe9l_wsd5bwmhIm-lYfo7cWkJRSc4xgc9mGI6dUUHwOG95KMncpL_LwFsCtsqsSMzke8TdFCzj4s8c6XFQ4wJrOpqexi-p-xjcH56S84XFllNrGbEfJv9HMwmwx_vvIx-nd783J93zw-3T1cXz02VlBaGtdakB3t3IL3RA79ooeOEdkJJ-rcGCOcMs4O4LjivYVFfQJlmZNWCjUofoxOZm7d9a36Fr2OYwrVUnPGCRFEClJVZ7PKpphzgkFvk3816UNToqeb6npTXW9aG1HVp7M6eK-tnyKt63VSMtr2SjLC2B56NcvWuZglfCNNKt5uYI8UUvMpfKG_Z3ZlkobAPwFhFpDb
Cites_doi 10.1016/j.neuroimage.2020.117282
10.1109/TASLP.2024.3426924
10.1109/TASLP.2023.3304482
10.1109/ICASSP40776.2020.9053396
10.1109/SLT61566.2024.10832160
10.1109/ICASSP40776.2020.9054266
10.1121/1.382599
10.1007/978-3-030-01246-5_35
10.1109/WASPAA.2019.8937253
10.1109/ICASSP49660.2025.10888410
10.1109/OJSP.2020.3045349
10.21437/Interspeech.2020-1602
10.1109/ASRU46091.2019.9004016
10.1109/ICASSP40776.2020.9054693
10.21437/Interspeech.2019-1130
10.1109/ICASSP.2019.8683007
10.21437/Interspeech.2022-11425
10.21437/Interspeech.2023-1280
10.1109/TASLP.2022.3221000
10.1109/TASLP.2021.3078883
10.21437/Odyssey.2020-62
10.1109/TASLP.2021.3133208
10.1109/ICASSP39728.2021.9414333
10.21437/CHiME.2020-1
10.21437/CHiME.2018-8
10.1109/JSTSP.2022.3188113
10.3115/1075527.1075614
10.1109/ICASSP40776.2020.9054683
10.1016/S0893-6080(00)00026-5
10.1109/TASLP.2019.2915167
10.53829/ntr202012fa6
10.1145/3503161.3548397
10.1121/1.1907229
10.1109/JPROC.2020.3018668
10.21437/CHiME.2024-1
10.1016/j.csl.2008.11.001
10.21437/Interspeech.2022-11252
10.21437/Interspeech.2017-667
10.1109/MSP.2013.2296173
10.1109/ICOSP.2014.7015050
10.1609/aaai.v32i1.11671
10.21437/Interspeech.2024-787
10.1109/ICASSP.2016.7471631
10.1109/JSTSP.2022.3207050
10.1109/ICASSP39728.2021.9414003
10.21437/Interspeech.2018-1400
10.1109/TASLP.2023.3328283
10.1109/TASLP.2016.2647702
10.21437/Interspeech.2019-1126
10.1007/978-3-030-01231-1_39
10.21437/Interspeech.2019-1513
10.21437/Interspeech.2020-2210
10.21437/Interspeech.2022-10894
10.21437/Interspeech.2023-2280
10.1109/JSTSP.2019.2922820
10.1109/LSP.2024.3383794
10.1109/ASRU.2017.8268910
10.1016/j.csl.2021.101317
10.1109/TASLP.2024.3492793
10.1006/csla.1994.1016
10.1109/ACCESS.2023.3243690
10.3758/s13414-015-0882-9
10.1109/97.736233
10.1121/1.392786
10.1109/ICASSP.2017.7952154
10.1109/TASLP.2022.3190739
10.1007/978-3-030-31372-2_17
10.1145/3197517.3201357
10.21437/Interspeech.2021-1939
10.21437/Interspeech.2019-1101
10.1186/s13634-016-0306-6
10.1109/ICASSP40776.2020.9053513
10.1109/TASLP.2021.3066303
10.1016/j.csl.2016.10.005
10.1017/ATSIP.2019.5
10.1109/ICASSP.2018.8462661
10.1109/ICASSP.2006.1660092
10.1109/ICASSP48485.2024.10448315
ContentType Journal Article
Copyright 2025 by The Acoustical Society of Japan
2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2025 by The Acoustical Society of Japan
– notice: 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID RYH
AAYXX
CITATION
7SP
7T9
7U5
8FD
H8D
L7M
DOI 10.1250/ast.e24.124
DatabaseName CiNii Complete
CrossRef
Electronics & Communications Abstracts
Linguistics and Language Behavior Abstracts (LLBA)
Solid State and Superconductivity Abstracts
Technology Research Database
Aerospace Database
Advanced Technologies Database with Aerospace
DatabaseTitle CrossRef
Aerospace Database
Linguistics and Language Behavior Abstracts (LLBA)
Solid State and Superconductivity Abstracts
Technology Research Database
Advanced Technologies Database with Aerospace
Electronics & Communications Abstracts
DatabaseTitleList Aerospace Database


DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 1347-5177
EndPage 209
ExternalDocumentID 10_1250_ast_e24_124
article_ast_46_3_46_e24_124_article_char_en
GroupedDBID 23M
2WC
5GY
6J9
ACGFO
ACIWK
ALMA_UNASSIGNED_HOLDINGS
CS3
E3Z
EBS
EJD
GX1
JSF
JSH
KQ8
OVT
RJT
RNS
RZJ
TR2
XSB
6TJ
7.U
RYH
TKC
AAYXX
CITATION
7SP
7T9
7U5
8FD
H8D
L7M
ID FETCH-LOGICAL-c411t-d5ce6818db3906f9b9e820684d4411aaa4d7adcfed3739cebaaae7c2d6c647f73
ISSN 1346-3969
IngestDate Tue Jul 15 09:40:36 EDT 2025
Sun Jul 06 05:05:48 EDT 2025
Fri Jun 27 00:54:24 EDT 2025
Wed Sep 03 06:30:29 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
License https://creativecommons.org/licenses/by-nd/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c411t-d5ce6818db3906f9b9e820684d4411aaa4d7adcfed3739cebaaae7c2d6c647f73
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink https://www.jstage.jst.go.jp/article/ast/46/3/46_e24.124/_article/-char/en
PQID 3230040640
PQPubID 1966373
PageCount 13
ParticipantIDs proquest_journals_3230040640
crossref_primary_10_1250_ast_e24_124
nii_cinii_1390866215976202240
jstage_primary_article_ast_46_3_46_e24_124_article_char_en
PublicationCentury 2000
PublicationDate 2025-05-01
PublicationDateYYYYMMDD 2025-05-01
PublicationDate_xml – month: 05
  year: 2025
  text: 2025-05-01
  day: 01
PublicationDecade 2020
PublicationPlace Tokyo
PublicationPlace_xml – name: Tokyo
PublicationTitle Acoustical Science and Technology
PublicationTitleAlternate Acoustical Science and Technology
PublicationTitle_FL Acoust. Sci. & Tech
PublicationYear 2025
Publisher ACOUSTICAL SOCIETY OF JAPAN
一般社団法人 日本音響学会
Japan Science and Technology Agency
Publisher_xml – name: ACOUSTICAL SOCIETY OF JAPAN
– name: 一般社団法人 日本音響学会
– name: Japan Science and Technology Agency
References 69) S. Cornell, T. Park, S. Huang, C. Boeddeker, X. Chang, M. Maciejewski, M. Wiesner, P. Garcia and S. Watanabe, "The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2024, pp. 1–6 (2024).
14) C. Xu, W. Rao, E. S. Chng and H. Li, "Time-domain speaker extraction network," Proc. ASRU 2019, pp. 327–334 (2019).
52) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming target-speaker ASR with neural transducer," Proc. Interspeech 2022, pp. 2673–2677 (2022).
77) A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711 (2012).
39) S. Wang, Z. Chen, K. A. Lee, Y. Qian and H. Li, "Overview of speaker modeling and its applications: From the lens of deep speaker representation learning," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 4971–4998 (2024).
48) I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux and J. R. Hershey, "Universal sound separation," Proc. WASPAA 2019, pp. 175–179 (2019).
55) Y. Luo, Z. Chen and T. Yoshioka, "Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation," Proc. ICASSP 2020, pp. 46–50 (2020).
53) J. Sohn, N. S. Kim and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Process. Lett., 6, 1–3 (1999).
11) P. Denisov and N. T. Vu, "End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning," Proc. Interspeech 2019, pp. 4425–4429 (2019).
22) J. Hershey and M. Casey, "Audio-visual sound separation via Hidden Markov Models," Proc. Advances in Neural Information Processing Systems 14 (2001).
86) T. Ashihara, T. Moriya, S. Horiguchi, J. Peng, T. Ochiai, M. Delcroix, K. Matsuura and H. Sato, "Investigation of speaker representation for target-speaker speech processing," arXiv preprint arXiv:2410.11243 (2024).
54) T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa and T. Nakatani, "Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues," Proc. Interspeech 2019, 31, 2718–2722 (2019).
9) M. Delcroix, K. Žmolíková, K. Kinoshita, A. Ogawa and T. Nakatani, "Single channel target speaker extraction and recognition with SpeakerBeam," Proc. ICASSP 2018, pp. 5554–5558 (2018).
75) K. Maekawa, "Corpus of spontaneous Japanese: Its design and evaluation," Proc. ISCA & IEEE Workshop Spontaneous Speech Processing and Recognition 2003 (2003).
19) J. H. Lee, H.-S. Choi and K. Lee, "Audio query-based music source separation," Proc. ISMIR 2019, pp. 878–885 (2019).
63) M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani and S. Araki, "Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam," Proc. ICASSP 2020, pp. 691–695 (2020).
61) D. B. Paul and J. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. Workshop Speech and Natural Language, pp. 357–362 (1992).
71) K. Kinoshita, M. Delcroix and N. Tawara, "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds," Proc. ICASSP 2021, pp. 7198–7202 (2021).
8) I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev and A. Romanenko, "Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario," Proc. Interspeech 2020, pp. 274–278 (2020).
89) W. Zhang and Y. Qian, "Weakly-supervised speech pre-training: A case study on target speech recognition," Proc. Interspeech 2023, pp. 3517–3521 (2023).
80) M. Delcroix, K. Kinoshita, T. Ochiai, K. Zmolikova, H. Sato and T. Nakatani, "Listen only to me! How well can target speech extraction handle false alarms?" Proc. Interspeech 2022, pp. 216–220 (2022).
3) K. Žmolíková, "Neural target speech extraction," Ph.D. thesis, Faculty of Information Technology, Brno University of Technology (2022).
47) D. Yu, M. Kolbaek, Z.-H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," Proc. ICASSP 2017, pp. 241–245 (2017).
58) J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small-room acoustics," J. Acoust. Soc. Am., 65, 943–950 (1979).
7) S. Ding, Q. Wang, S.-Y. Chang, L. Wan and I. Lopex Moreno, "Personal VAD: Speaker-conditioned voice activity detection," Proc. Odyssey 2020, pp. 433–439 (2020).
32) T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita and S. Araki, "Listen to what you want: Neural network-based universal sound selector," Proc. Interspeech 2020, pp. 1441–1445 (2020).
91) J. Lin, M. Ge, W. Wang, H. Li and M. Feng, "Selective HuBERT: Self-supervised pre-training for target speaker in clean and mixture speech," IEEE Signal Processing Lett., 31, 1014–1018 (2024).
82) S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu and F. Wei, "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J. Sel. Top. Signal Process., 16, 1505–1518 (2022).
90) J. Lin, M. Ge, J. Ao, L. Deng and H. Li, "SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech," Proc. Interspeech 2024, pp. 597–601 (2024).
23) B. Rivet, W. Wang, S. M. Naqvi and J. A. Chambers, "Audiovisual speech source separation: An overview of key methodologies," IEEE Signal Process. Mag., 31, 125–134 (2014).
51) P. Bell, J. Fainberg, O. Klejch, J. Li, S. Renals and P. Swietojanski, "Adaptation algorithms for neural network-based speech recognition: An overview," IEEE Open J. Signal Process., 2, 33–66 (2020).
6) D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz and C. Jutten, "An analysis of visual speech information applied to voice activity detection," Proc. ICASSP 2006, Vol. 1, pp. 601–604 (2006).
1) E. C. Cherry, "Some experiments on the recognition of speech, with one and with two ears," J. Acoust. Soc. Am., 25, 975–979 (1953).
45) J. Du, Y. Tu, Y. Xu, L. Dai and C.-H. Lee, "Speech separation of a target speaker based on deep neural networks," Proc. ICSP 2014, pp. 473–477 (2014).
5) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Learning speaker representation for neural network based multichannel speaker extraction," Proc. ASRU 2017, pp. 8–15 (2017).
92) E. Ceolini, J. Hjortkjær, D. D. E. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu and N. Mesgarani, "Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception," NeuroImage, 223, 117282 (2020).
38) R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix and T. Nakatani, "Far-field automatic speech recognition," Proc. IEEE, 109, 124–148 (2021).
13) Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. Lopez Moreno, "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," Proc. Interspeech 2019, pp. 2728–2732 (2019).
56) Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim and S. Watanabe, "TF-GridNet: Integrating full- and sub-band modeling for speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., pp. 3221–3236 (2023).
68) S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. Ben Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka and N. Ryant, "CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. International Workshop Speech Processing in Everyday Environments (CHiME) 2020, pp. 1–7 (2020).
79) M. Borsdorf, C. Xu, H. Li and T. Schultz, "Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers," Proc. Interspeech 2021, pp. 1469–1473 (2021).
44) J. R. Hershey, S. J. Rennie, P. A. Olsen and T. T. Kristjansson, "Super-human multi-talker speech recognition: A graphical modeling approach," Comput. Speech Lang., 24, 45–66 (2010).
4) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Speaker-aware neural network based beamformer for speaker extraction in speech mixtures," Proc. Interspeech 2017, pp. 2655–2659 (2017).
43) H. Sawada, N. Ono, H. Kameoka, D. Kitamura and H. Saruwatari, "A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF," APSIPA Trans. Signal Inf. Process., 8, e12 (2019).
83) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever, "Robust speech recognition via large-scale weak supervision," Proc. ICML 2023, pp. 28492–28518 (2023).
27) A. Owens and A. A. Efros, "Audio-visual scene analysis with self-supervised multisensory features," Proc. ECCV 2018, pp. 631–648 (2018).
72) N. Kamo, N. Tawara, A. Ando, T. Kano, H. Sato, R. Ikeshita, T. Moriya, S. Horiguchi, K. Matsuura, A. Ogawa, A. Plaquet, T. Ashihara, T. Ochiai, M. Mimura, M. Delcroix, T. Nakatani, T. Asami and S. Araki, "Microphone array geometry independent multi-talker distant ASR: NTT system for the DASR task of the CHiME-8 challenge," Comput. Speech Lang. (under review).
25) T. Afouras, J. S. Chung and A. Zisserman, "The conversation: Deep audio-visual speech enhancement," Proc. Interspeech 2018, pp. 3244–3248 (2018).
57) E. Perez, F. Strub, H. de Vries, V. Dumoulin and A. Courville, "FiLM: Visual reasoning with a general conditioning layer," Proc. 32nd AAAI 2018 (2018).
84) A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath and S. Watanabe, "Self-supervised speech representation learning: A review," IEEE J. Sel. Top. Signal Process., 16, 1179–1210 (2022).
64) M. Fukui, S. Saito and K. Kobayashi, "Media-processing technologies for ultimate private sound space," NTT Tech. Rev., 18(12)
44
88
45
89
46
47
48
49
90
91
92
50
51
52
53
10
54
11
55
12
56
13
57
14
58
15
59
16
17
18
19
1
2
3
4
5
6
7
8
9
60
61
62
63
20
64
21
65
22
66
23
67
24
68
25
69
26
27
28
29
70
71
72
73
30
74
31
75
32
76
33
77
34
78
35
79
36
37
38
39
80
81
82
83
40
84
41
85
42
86
43
87
References_xml – reference: 54) T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa and T. Nakatani, "Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues," Proc. Interspeech 2019, 31, 2718–2722 (2019).
– reference: 56) Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim and S. Watanabe, "TF-GridNet: Integrating full- and sub-band modeling for speech separation," IEEE/ACM Trans. Audio Speech Lang. Process., pp. 3221–3236 (2023).
– reference: 76) J. Barker, R. Marxer, E. Vincent and S. Watanabe, "The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes," Comput. Speech Lang., 46, 605–626 (2017).
– reference: 90) J. Lin, M. Ge, J. Ao, L. Deng and H. Li, "SA-WavLM: Speaker-aware self-supervised pre-training for mixture speech," Proc. Interspeech 2024, pp. 597–601 (2024).
– reference: 47) D. Yu, M. Kolbaek, Z.-H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," Proc. ICASSP 2017, pp. 241–245 (2017).
– reference: 15) J. Heitkaemper, T. Fehér, M. Freitag and R. Haeb-Umbach, "A study on online source extraction in the presence of changing speaker positions," Proc. SLSP 2019, pp. 198–209 (2019).
– reference: 68) S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, D. Snyder, A. S. Subramanian, J. Trmal, B. Ben Yair, C. Boeddeker, Z. Ni, Y. Fujita, S. Horiguchi, N. Kanda, T. Yoshioka and N. Ryant, "CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings," Proc. International Workshop Speech Processing in Everyday Environments (CHiME) 2020, pp. 1–7 (2020).
– reference: 41) G. J. Brown and M. Cooke, "Computational auditory scene analysis," Comput. Speech Lang., 8, 297–336 (1994).
– reference: 11) P. Denisov and N. T. Vu, "End-to-end multi-speaker speech recognition using speaker embeddings and transfer learning," Proc. Interspeech 2019, pp. 4425–4429 (2019).
– reference: 22) J. Hershey and M. Casey, "Audio-visual sound separation via Hidden Markov Models," Proc. Advances in Neural Information Processing Systems 14 (2001).
– reference: 24) D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu and J. Jensen, "An overview of deep-learning-based audio-visual speech enhancement and separation," IEEE/ACM Trans. Audio Speech Lang. Process., 29, 1368–1396 (2021).
– reference: 62) Y. Luo and N. Mesgarani, "TasNet: Surpassing ideal time-frequency masking for speech separation," Proc. ICASSP 2018, pp. 1256–1266 (2018).
– reference: 85) S.-w. Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. t. Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed and H. y. Lee, "SUPERB: Speech Processing Universal PERformance Benchmark," Proc. Interspeech 2021, pp. 1194–1198 (2021).
– reference: 87) J. Peng, M. Delcroix, T. Ochiai, O. Plchot, S. Araki and J. Černockỳ, "Target speech extraction with pre-trained self-supervised learning models," Proc. ICASSP 2024, pp. 10421–10425 (2024).
– reference: 33) D. Samuel, A. Ganeshan and J. Naradowsky, "Meta-learning extractors for music source separation," Proc. ICASSP 2020, pp. 816–820 (2020).
– reference: 91) J. Lin, M. Ge, W. Wang, H. Li and M. Feng, "Selective HuBERT: Self-supervised pre-training for target speaker in clean and mixture speech," IEEE Signal Processing Lett., 31, 1014–1018 (2024).
– reference: 9) M. Delcroix, K. Žmolíková, K. Kinoshita, A. Ogawa and T. Nakatani, "Single channel target speaker extraction and recognition with SpeakerBeam," Proc. ICASSP 2018, pp. 5554–5558 (2018).
– reference: 27) A. Owens and A. A. Efros, "Audio-visual scene analysis with self-supervised multisensory features," Proc. ECCV 2018, pp. 631–648 (2018).
– reference: 65) X. Liu, H. Liu, Q. Kong, X. Mei, J. Zhao, Q. Huang, M. D. Plumbley and W. Wang, "Separate what you describe: Language-queried audio source separation," Proc. Interspeech 2022, pp. 1801–1805 (2022).
– reference: 1) E. C. Cherry, "Some experiments on the recognition of speech, with one and with two ears," J. Acoust. Soc. Am., 25, 975–979 (1953).
– reference: 81) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming end-to-end target-speaker automatic speech recognition and activity detection," IEEE Access, 11, 13906–13917 (2023).
– reference: 26) A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman and M. Rubinstein, "Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation," ACM Trans. Graph., 37(4), No. 112 (2018).
– reference: 82) S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu and F. Wei, "WavLM: Large-scale self-supervised pre-training for full stack speech processing," IEEE J. Sel. Top. Signal Process., 16, 1505–1518 (2022).
– reference: 29) J. L. Flanagan, J. D. Johnston, R. Zahn and G. W. Elko, "Computer-steered microphone arrays for sound transduction in large rooms," J. Acoust. Soc. Am., 78, 1508–1518 (1985).
– reference: 46) J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," Proc. ICASSP 2016, pp. 31–35 (2016).
– reference: 23) B. Rivet, W. Wang, S. M. Naqvi and J. A. Chambers, "Audiovisual speech source separation: An overview of key methodologies," IEEE Signal Process. Mag., 31, 125–134 (2014).
– reference: 77) A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711 (2012).
– reference: 39) S. Wang, Z. Chen, K. A. Lee, Y. Qian and H. Li, "Overview of speaker modeling and its applications: From the lens of deep speaker representation learning," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 4971–4998 (2024).
– reference: 64) M. Fukui, S. Saito and K. Kobayashi, "Media-processing technologies for ultimate private sound space," NTT Tech. Rev., 18(12), pp. 43–47 (2020).
– reference: 63) M. Delcroix, T. Ochiai, K. Zmolikova, K. Kinoshita, N. Tawara, T. Nakatani and S. Araki, "Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam," Proc. ICASSP 2020, pp. 691–695 (2020).
– reference: 37) R. Prabhavalkar, T. Hori, T. N. Sainath, R. Schlüter and S. Watanabe, "End-to-end speech recognition: A survey," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 325–351 (2024).
– reference: 60) J. Le Roux, S. Wisdom, H. Erdogan and J. R. Hershey, "SDR—Half-baked or well done?" Proc. ICASSP 2019, pp. 626–630 (2019).
– reference: 84) A. Mohamed, H.-y. Lee, L. Borgholt, J. D. Havtorn, J. Edin, C. Igel, K. Kirchhoff, S.-W. Li, K. Livescu, L. Maaløe, T. N. Sainath and S. Watanabe, "Self-supervised speech representation learning: A review," IEEE J. Sel. Top. Signal Process., 16, 1179–1210 (2022).
– reference: 13) Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia and I. Lopez Moreno, "VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking," Proc. Interspeech 2019, pp. 2728–2732 (2019).
– reference: 7) S. Ding, Q. Wang, S.-Y. Chang, L. Wan and I. Lopex Moreno, "Personal VAD: Speaker-conditioned voice activity detection," Proc. Odyssey 2020, pp. 433–439 (2020).
– reference: 10) N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu and S. Watanabe, "Auxiliary interference speaker loss for target-speaker speech recognition," Proc. Interspeech 2019, pp. 236–240 (2019).
– reference: 40) S. Gannot, E. Vincent, S. Markovich-Golan and A. Ozerov, "A consolidated perspective on multimicrophone speech enhancement and source separation," IEEE/ACM Trans. Audio Speech Lang. Process., 25, 692–730 (2017).
– reference: 89) W. Zhang and Y. Qian, "Weakly-supervised speech pre-training: A case study on target speech recognition," Proc. Interspeech 2023, pp. 3517–3521 (2023).
– reference: 21) B. Gfeller, D. Roblek and M. Tagliasacchi, "One-shot conditional audio filtering of arbitrary sounds," Proc. ICASSP 2021, pp. 501–505 (2021).
– reference: 52) T. Moriya, H. Sato, T. Ochiai, M. Delcroix and T. Shinozaki, "Streaming target-speaker ASR with neural transducer," Proc. Interspeech 2022, pp. 2673–2677 (2022).
– reference: 32) T. Ochiai, M. Delcroix, Y. Koizumi, H. Ito, K. Kinoshita and S. Araki, "Listen to what you want: Neural network-based universal sound selector," Proc. Interspeech 2020, pp. 1441–1445 (2020).
– reference: 34) Y. Ohishi, M. Delcroix, T. Ochiai, S. Araki, D. Takeuchi, D. Niizumi, A. Kimura, N. Harada and K. Kashino, "ConceptBeam: Concept driven target speech extraction," Proc. ACM Multimedia 2022, pp. 4252–4260 (2022).
– reference: 72) N. Kamo, N. Tawara, A. Ando, T. Kano, H. Sato, R. Ikeshita, T. Moriya, S. Horiguchi, K. Matsuura, A. Ogawa, A. Plaquet, T. Ashihara, T. Ochiai, M. Mimura, M. Delcroix, T. Nakatani, T. Asami and S. Araki, "Microphone array geometry independent multi-talker distant ASR: NTT system for the DASR task of the CHiME-8 challenge," Comput. Speech Lang. (under review).
– reference: 92) E. Ceolini, J. Hjortkjær, D. D. E. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu and N. Mesgarani, "Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception," NeuroImage, 223, 117282 (2020).
– reference: 12) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget and J. Černockỳ, "SpeakerBeam: Speaker aware neural network for target speaker extraction in speech mixtures," IEEE J. Sel. Top. Signal Process., 13, 800–814 (2019).
– reference: 5) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Learning speaker representation for neural network based multichannel speaker extraction," Proc. ASRU 2017, pp. 8–15 (2017).
– reference: 57) E. Perez, F. Strub, H. de Vries, V. Dumoulin and A. Courville, "FiLM: Visual reasoning with a general conditioning layer," Proc. 32nd AAAI 2018 (2018).
– reference: 3) K. Žmolíková, "Neural target speech extraction," Ph.D. thesis, Faculty of Information Technology, Brno University of Technology (2022).
– reference: 2) A. W. Bronkhorst, "The cocktail-party problem revisited: Early processing and selection of multi-talker speech," Atten. Percept. Psychophys., 77, 1465–1487 (2015).
– reference: 38) R. Haeb-Umbach, J. Heymann, L. Drude, S. Watanabe, M. Delcroix and T. Nakatani, "Far-field automatic speech recognition," Proc. IEEE, 109, 124–148 (2021).
– reference: 55) Y. Luo, Z. Chen and T. Yoshioka, "Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation," Proc. ICASSP 2020, pp. 46–50 (2020).
– reference: 6) D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz and C. Jutten, "An analysis of visual speech information applied to voice activity detection," Proc. ICASSP 2006, Vol. 1, pp. 601–604 (2006).
– reference: 30) R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou and D. Yu, "Neural spatial filter: Target speaker speech separation assisted with directional information," Proc. Interspeech 2019, pp. 4290–4294 (2019).
– reference: 45) J. Du, Y. Tu, Y. Xu, L. Dai and C.-H. Lee, "Speech separation of a target speaker based on deep neural networks," Proc. ICSP 2014, pp. 473–477 (2014).
– reference: 86) T. Ashihara, T. Moriya, S. Horiguchi, J. Peng, T. Ochiai, M. Delcroix, K. Matsuura and H. Sato, "Investigation of speaker representation for target-speaker speech processing," arXiv preprint arXiv:2410.11243 (2024).
– reference: 36) T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe and S. Narayanan, "A review of speaker diarization: Recent advances with deep learning," Comput. Speech Lang., 72, 101317 (2022).
– reference: 58) J. B. Allen and D. A. Berkley, "Image method for efficiently simulating small-room acoustics," J. Acoust. Soc. Am., 65, 943–950 (1979).
– reference: 8) I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev and A. Romanenko, "Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario," Proc. Interspeech 2020, pp. 274–278 (2020).
– reference: 17) J. Málek, J. Janský, Z. Koldovský, T. Kounovský, J. Čmejla and J. Žd'anský, "Target speech extraction: Independent vector extraction guided by supervised speaker identification," IEEE/ACM Trans. Audio Speech Lang. Process., 30, 2295–2309 (2022).
– reference: 53) J. Sohn, N. S. Kim and W. Sung, "A statistical model-based voice activity detection," IEEE Signal Process. Lett., 6, 1–3 (1999).
– reference: 73) H. Ma, Z. Peng, M. Shao, J. Li and J. Liu, "Extending Whisper with prompt tuning to target-speaker ASR," Proc. ICASSP 2024 (2024).
– reference: 75) K. Maekawa, "Corpus of spontaneous Japanese: Its design and evaluation," Proc. ISCA & IEEE Workshop Spontaneous Speech Processing and Recognition 2003 (2003).
– reference: 44) J. R. Hershey, S. J. Rennie, P. A. Olsen and T. T. Kristjansson, "Super-human multi-talker speech recognition: A graphical modeling approach," Comput. Speech Lang., 24, 45–66 (2010).
– reference: 61) D. B. Paul and J. Baker, "The design for the Wall Street Journal-based CSR corpus," Proc. Workshop Speech and Natural Language, pp. 357–362 (1992).
– reference: 66) E. Fonseca, X. Favory, J. Pons, F. Font and X. Serra, "FSD50k: An open dataset of human-labeled sound events," IEEE/ACM Trans. Audio Speech Lang. Process., 30, 829–852 (2021).
– reference: 4) K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa and T. Nakatani, "Speaker-aware neural network based beamformer for speaker extraction in speech mixtures," Proc. Interspeech 2017, pp. 2655–2659 (2017).
– reference: 25) T. Afouras, J. S. Chung and A. Zisserman, "The conversation: Deep audio-visual speech enhancement," Proc. Interspeech 2018, pp. 3244–3248 (2018).
– reference: 79) M. Borsdorf, C. Xu, H. Li and T. Schultz, "Universal speaker extraction in the presence and absence of target speakers for speech of one and two talkers," Proc. Interspeech 2021, pp. 1469–1473 (2021).
– reference: 74) T. Moriya, H. Sato, T. Ochiai, M. Delcroix, T. Ashihara, K. Matsuura, T. Tanaka, R. Masumura, A. Ogawa and T. Asami, "Knowledge distillation for neural transducer-based target-speaker ASR: Exploiting parallel mixture/single-talker speech data," Proc. Interspeech 2023, pp. 899–903 (2023).
– reference: 80) M. Delcroix, K. Kinoshita, T. Ochiai, K. Zmolikova, H. Sato and T. Nakatani, "Listen only to me! How well can target speech extraction handle false alarms?" Proc. Interspeech 2022, pp. 216–220 (2022).
– reference: 16) J. Janský, J. Málek, J. Čmejla, T. Kounovský, Z. Koldovský and J. Žd'anský, "Adaptive blind audio source extraction supervised by dominant speaker identification using x-vectors," Proc. ICASSP 2020, pp. 676–680 (2020).
– reference: 78) T. Ochiai, K. Iwamoto, M. Delcroix, R. Ikeshita, H. Sato, S. Araki and S. Katagiri, "Rethinking processing distortions: Disentangling the impact of speech enhancement errors on speech recognition performance," IEEE/ACM Trans. Audio Speech Lang. Process., 32, 3589–3602 (2024).
– reference: 14) C. Xu, W. Rao, E. S. Chng and H. Li, "Time-domain speaker extraction network," Proc. ASRU 2019, pp. 327–334 (2019).
– reference: 20) M. Delcroix, J. Bennasar Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi and S. Araki, "SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning," IEEE/ACM Trans. Audio Speech Lang. Process., 31, 121–136 (2022).
– reference: 71) K. Kinoshita, M. Delcroix and N. Tawara, "Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds," Proc. ICASSP 2021, pp. 7198–7202 (2021).
– reference: 19) J. H. Lee, H.-S. Choi and K. Lee, "Audio query-based music source separation," Proc. ISMIR 2019, pp. 878–885 (2019).
– reference: 28) H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott and A. Torralba, "The sound of pixels," Proc. ECCV 2018, pp. 570–586 (2018).
– reference: 49) J. Yu, S.-X. Zhang, B. Wu, S. Liu, S. Hu, M. Geng, X. Liu, H. Meng and D. Yu, "Audio-visual multi-channel integration and recognition of overlapped speech," IEEE/ACM Trans. Audio Speech Lang. Process., 29, 2067–2082 (2021).
– reference: 88) C. Hernandez-Olivan, M. Delcroix, T. Ochiai, D. Niizumi, N. Tawara, T. Nakatani and S. Araki, "SoundBeam meets M2D: Target sound extraction with audio foundation model," arXiv preprint arXiv:2409.12528 (2024).
– reference: 43) H. Sawada, N. Ono, H. Kameoka, D. Kitamura and H. Saruwatari, "A review of blind source separation methods: Two converging routes to ILRMA originating from ICA and NMF," APSIPA Trans. Signal Inf. Process., 8, e12 (2019).
– reference: 69) S. Cornell, T. Park, S. Huang, C. Boeddeker, X. Chang, M. Maciejewski, M. Wiesner, P. Garcia and S. Watanabe, "The CHiME-8 DASR challenge for generalizable and array agnostic distant automatic speech recognition and diarization," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2024, pp. 1–6 (2024).
– reference: 70) C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann and R. Haeb-Umbach, "Front-end processing for the CHiME-5 dinner party scenario," Proc. Int. Workshop Speech Processing in Everyday Environments (CHiME) 2018, Vol. 1, pp. 35–40 (2018).
– reference: 31) Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang and M. D. Plumbley, "Source separation with weakly labelled data: An approach to computational auditory scene analysis," Proc. ICASSP 2020, pp. 101–105 (2020).
– reference: 48) I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. Wilson, J. Le Roux and J. R. Hershey, "Universal sound separation," Proc. WASPAA 2019, pp. 175–179 (2019).
– reference: 67) K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr and T. Yoshioka, "A summary of the REVERB challenge: State-of-the-art and remaining challenges in reverberant speech processing research," EURASIP J. Adv. Signal Process., 2016, 1–19 (2016).
– reference: 18) P. Seetharaman, G. Wichern, S. Venkataramani and J. Le Roux, "Class-conditional embeddings for music source separation," Proc. ICASSP 2019, pp. 301–305 (2019).
– reference: 42) A. Hyvärinen and E. Oja, "Independent component analysis: Algorithms and applications," Neural Netw., 13, 411–430 (2000).
– reference: 83) A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey and I. Sutskever, "Robust speech recognition via large-scale weak supervision," Proc. ICML 2023, pp. 28492–28518 (2023).
– reference: 59) S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747 (2016).
– reference: 51) P. Bell, J. Fainberg, O. Klejch, J. Li, S. Renals and P. Swietojanski, "Adaptation algorithms for neural network-based speech recognition: An overview," IEEE Open J. Signal Process., 2, 33–66 (2020).
– reference: 35) H. Ma, Z. Peng, X. Li, M. Shao, X. Wu and J. Liu, "CLAPSep: Leveraging contrastive pre-trained model for multi-modal query-conditioned target sound extraction," arXiv preprint arXiv:2402.17455 (2024).
– reference: 50) K. Shinoda, "Speaker adaptation techniques for automatic speech recognition," Proc. APSIPA ASC 2011 (2011).
– ident: 92
  doi: 10.1016/j.neuroimage.2020.117282
– ident: 78
  doi: 10.1109/TASLP.2024.3426924
– ident: 56
  doi: 10.1109/TASLP.2023.3304482
– ident: 35
– ident: 31
  doi: 10.1109/ICASSP40776.2020.9053396
– ident: 86
  doi: 10.1109/SLT61566.2024.10832160
– ident: 55
  doi: 10.1109/ICASSP40776.2020.9054266
– ident: 58
  doi: 10.1121/1.382599
– ident: 28
  doi: 10.1007/978-3-030-01246-5_35
– ident: 48
  doi: 10.1109/WASPAA.2019.8937253
– ident: 88
  doi: 10.1109/ICASSP49660.2025.10888410
– ident: 51
  doi: 10.1109/OJSP.2020.3045349
– ident: 8
  doi: 10.21437/Interspeech.2020-1602
– ident: 14
  doi: 10.1109/ASRU46091.2019.9004016
– ident: 16
  doi: 10.1109/ICASSP40776.2020.9054693
– ident: 83
– ident: 11
  doi: 10.21437/Interspeech.2019-1130
– ident: 18
  doi: 10.1109/ICASSP.2019.8683007
– ident: 52
  doi: 10.21437/Interspeech.2022-11425
– ident: 60
– ident: 89
  doi: 10.21437/Interspeech.2023-1280
– ident: 20
  doi: 10.1109/TASLP.2022.3221000
– ident: 49
  doi: 10.1109/TASLP.2021.3078883
– ident: 22
– ident: 7
  doi: 10.21437/Odyssey.2020-62
– ident: 66
  doi: 10.1109/TASLP.2021.3133208
– ident: 71
  doi: 10.1109/ICASSP39728.2021.9414333
– ident: 68
  doi: 10.21437/CHiME.2020-1
– ident: 70
  doi: 10.21437/CHiME.2018-8
– ident: 82
  doi: 10.1109/JSTSP.2022.3188113
– ident: 61
  doi: 10.3115/1075527.1075614
– ident: 63
  doi: 10.1109/ICASSP40776.2020.9054683
– ident: 59
– ident: 42
  doi: 10.1016/S0893-6080(00)00026-5
– ident: 62
  doi: 10.1109/TASLP.2019.2915167
– ident: 64
  doi: 10.53829/ntr202012fa6
– ident: 34
  doi: 10.1145/3503161.3548397
– ident: 1
  doi: 10.1121/1.1907229
– ident: 50
– ident: 30
– ident: 38
  doi: 10.1109/JPROC.2020.3018668
– ident: 69
  doi: 10.21437/CHiME.2024-1
– ident: 75
– ident: 44
  doi: 10.1016/j.csl.2008.11.001
– ident: 80
  doi: 10.21437/Interspeech.2022-11252
– ident: 4
  doi: 10.21437/Interspeech.2017-667
– ident: 23
  doi: 10.1109/MSP.2013.2296173
– ident: 45
  doi: 10.1109/ICOSP.2014.7015050
– ident: 57
  doi: 10.1609/aaai.v32i1.11671
– ident: 90
  doi: 10.21437/Interspeech.2024-787
– ident: 72
– ident: 46
  doi: 10.1109/ICASSP.2016.7471631
– ident: 84
  doi: 10.1109/JSTSP.2022.3207050
– ident: 21
  doi: 10.1109/ICASSP39728.2021.9414003
– ident: 25
  doi: 10.21437/Interspeech.2018-1400
– ident: 37
  doi: 10.1109/TASLP.2023.3328283
– ident: 40
  doi: 10.1109/TASLP.2016.2647702
– ident: 10
  doi: 10.21437/Interspeech.2019-1126
– ident: 27
  doi: 10.1007/978-3-030-01231-1_39
– ident: 54
  doi: 10.21437/Interspeech.2019-1513
– ident: 32
  doi: 10.21437/Interspeech.2020-2210
– ident: 65
  doi: 10.21437/Interspeech.2022-10894
– ident: 85
– ident: 74
  doi: 10.21437/Interspeech.2023-2280
– ident: 12
  doi: 10.1109/JSTSP.2019.2922820
– ident: 91
  doi: 10.1109/LSP.2024.3383794
– ident: 5
  doi: 10.1109/ASRU.2017.8268910
– ident: 36
  doi: 10.1016/j.csl.2021.101317
– ident: 39
  doi: 10.1109/TASLP.2024.3492793
– ident: 3
– ident: 41
  doi: 10.1006/csla.1994.1016
– ident: 81
  doi: 10.1109/ACCESS.2023.3243690
– ident: 2
  doi: 10.3758/s13414-015-0882-9
– ident: 53
  doi: 10.1109/97.736233
– ident: 73
– ident: 29
  doi: 10.1121/1.392786
– ident: 47
  doi: 10.1109/ICASSP.2017.7952154
– ident: 17
  doi: 10.1109/TASLP.2022.3190739
– ident: 15
  doi: 10.1007/978-3-030-31372-2_17
– ident: 19
– ident: 26
  doi: 10.1145/3197517.3201357
– ident: 79
  doi: 10.21437/Interspeech.2021-1939
– ident: 13
  doi: 10.21437/Interspeech.2019-1101
– ident: 67
  doi: 10.1186/s13634-016-0306-6
– ident: 77
– ident: 33
  doi: 10.1109/ICASSP40776.2020.9053513
– ident: 24
  doi: 10.1109/TASLP.2021.3066303
– ident: 76
  doi: 10.1016/j.csl.2016.10.005
– ident: 43
  doi: 10.1017/ATSIP.2019.5
– ident: 9
  doi: 10.1109/ICASSP.2018.8462661
– ident: 6
  doi: 10.1109/ICASSP.2006.1660092
– ident: 87
  doi: 10.1109/ICASSP48485.2024.10448315
SSID ssj0024956
Score 2.3527725
SecondaryResourceType review_article
Snippet This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an...
SourceID proquest
crossref
nii
jstage
SourceType Aggregation Database
Index Database
Publisher
StartPage 197
SubjectTerms Audio data
Audio processing
Automatic speech recognition
Information retrieval
Neural networks
Personalized voice activity detection
Sound sources
Speech processing
Speech recognition
Target detection
Target speaker automatic speech recognition
Target speech extraction
Voice activity detectors
Voice recognition
Title Target sound information extraction: Speech and audio processing with neural networks conditioned on target clues
URI https://www.jstage.jst.go.jp/article/ast/46/3/46_e24.124/_article/-char/en
https://cir.nii.ac.jp/crid/1390866215976202240
https://www.proquest.com/docview/3230040640
Volume 46
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
ispartofPNX Acoustical Science and Technology, 2025/05/01, Vol.46(3), pp.197-209
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3db9MwELe6ARIviE9R2JAf5qcqo40dp95b2qVMSMADndS3KLFdERhtaRoE_Fv8g9zZaVq0SXxIjZU4d0nl-8V3Tu6DkBMMl-wbyYMQwBEIxfswD3IeRMM4F7KfC2MxwPnNW3lxKV7Polmn83PPa6neFKf6x41xJf8jVegDuWKU7D9Itr0odMA-yBdakDC0fydj58bdq7A0Uq9JgerkCTPu2kcs4Ir__cpa7SPY8tqUy97KRwe072ExqSWIauFdwit0RTc-h5FxHxP8bfRV3TgcbrPW6qWrBdYkFNnGHlx_W_8OHarL3rSqQWm2euDcXoEWKL9hvJBuBb9cl9_z3jT_lM_rz2WLx-pDiaml3YkF0Oy_rQijnW-gwxdLBRsN0YkjhXbIkjFLY5ZAZ8rSiKkRSzhLJRtxpiJHnMCvh10K6CK3M3ZsiqkJ0jU7E-RPzlkiHduYqWRvWudCBlz5ojCndtsXB9GgKSLT6ILcfIW_eaNqAVsR9WYF_KGAQ3FAboWwNkFt8Go22CV4VK5kcHvLJigU2F_uMf9mBt3-CCsBTPFwsCjLa0aBs3Sm98m9ZolCE4-3B6RjFw_JHecqrKtH5ItHHXWoo3uoozvUnVGPOQp4oA5zdIc5ipijHnN0izm6hzkK1_KYow5zj8nlJJ2OL4KmcEegxWCwCUykrQRL0BRc9eVcFcpimYChMGB8D_IcZoA4N3puDY-50raALhvr0EgtRTyP-RNyuID7PSVUD0W_gA1ouOBxoUQRzudxZCzYmbnmXXKyHcds5fOzZLiuheHOYLgzGG44EF1y5se4JWoeWkckZMaxaYjbcxj4CPNMlxyDXDJdYgsrpv5QSrCUwZAPnUXcJUdbiWXN7FBlPMRcdvid_Nkf2J-Tu7sn5Ygcbta1PQZLd1O8cMj6BaozonM
linkProvider Geneva Foundation for Medical Education and Research
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Target+sound+information+extraction%3A+Speech+and+audio+processing+with+neural+networks+conditioned+on+target+clues&rft.jtitle=Acoustical+Science+and+Technology&rft.au=Ochiai+Tsubasa&rft.au=Delcroix+Marc&rft.au=Moriya+Takafumi&rft.au=Ashihara+Takanori&rft.date=2025-05-01&rft.pub=%E4%B8%80%E8%88%AC%E7%A4%BE%E5%9B%A3%E6%B3%95%E4%BA%BA+%E6%97%A5%E6%9C%AC%E9%9F%B3%E9%9F%BF%E5%AD%A6%E4%BC%9A&rft.issn=1346-3969&rft.eissn=1347-5177&rft.volume=advpub&rft_id=info:doi/10.1250%2Fast.e24.124
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-3969&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-3969&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-3969&client=summon