L-SpEx: Localized Target Speaker Extraction

Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speak...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Ge, Meng, Xu, Chenglin, Wang, Longbiao, Chng, Eng Siong, Dang, Jianwu, Li, Haizhou
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 21.02.2022
Subjects	Audio equipment Beamforming Cues Direction of arrival Embedding Feature extraction Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speaker extraction aims to extract the target speaker's voice from a multi-talker speech mixture given an auxiliary reference utterance. Recent studies show that speaker extraction benefits from the location or direction of the target speaker. However, these studies assume that the target speaker's location is known in advance or detected by an extra visual cue, e.g., face image or video. In this paper, we propose an end-to-end localized target speaker extraction on pure speech cues, that is called L-SpEx. Specifically, we design a speaker localizer driven by the target speaker's embedding to extract the spatial features, including direction-of-arrival (DOA) of the target speaker and beamforming output. Then, the spatial cues and target speaker's embedding are both used to form a top-down auditory attention to the target speaker. Experiments on the multi-channel reverberant dataset called MC-Libri2Mix show that our L-SpEx approach significantly outperforms the baseline system.
ISSN:	2331-8422