Two-Stage Cross-Modal Speech Separation Based on Speech Embeddings
In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information abo...
Saved in:
Published in | Biomedical Circuits and Systems Conference pp. 1 - 5 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
24.10.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2766-4465 |
DOI | 10.1109/BioCAS61083.2024.10798276 |
Cover
Loading…
Summary: | In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information about the speaker to extract such features. To address this limitation, we propose a two-stage cross-modal speech separation model, whose speech embeddings extraction module avoids the need of pre-enrolled voice and speaker IDS, thereby expanding the method's applicability. In addition, the proposed multi-level feature fusion can fully mine different levels of features in the model to avoid information loss. We conduct a series of experiments in the public dataset VoxCeleb2. Compared with other researches, our method demonstrates superior performance in evaluating the key indicators of speech separation. |
---|---|
ISSN: | 2766-4465 |
DOI: | 10.1109/BioCAS61083.2024.10798276 |