Two-Stage Cross-Modal Speech Separation Based on Speech Embeddings

In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information abo...

Full description

Saved in:

Bibliographic Details
Published in	Biomedical Circuits and Systems Conference pp. 1 - 5
Main Authors	Deng, Yuanjie, Liu, Yinggang, Wei, Ying
Format	Conference Proceeding
Language	English
Published	IEEE 24.10.2024
Subjects	Acoustics Biological system modeling Circuits and systems cross-modal Data mining Feature extraction Integrated circuit modeling multi-level feature fusion speech embedding two-stage speech separation Visualization
Online Access	Get full text
ISSN	2766-4465
DOI	10.1109/BioCAS61083.2024.10798276

Cover

Loading…

More Information
Summary:	In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information about the speaker to extract such features. To address this limitation, we propose a two-stage cross-modal speech separation model, whose speech embeddings extraction module avoids the need of pre-enrolled voice and speaker IDS, thereby expanding the method's applicability. In addition, the proposed multi-level feature fusion can fully mine different levels of features in the model to avoid information loss. We conduct a series of experiments in the public dataset VoxCeleb2. Compared with other researches, our method demonstrates superior performance in evaluating the key indicators of speech separation.
ISSN:	2766-4465
DOI:	10.1109/BioCAS61083.2024.10798276