Two-Stage Cross-Modal Speech Separation Based on Speech Embeddings

In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information abo...

Full description

Saved in:
Bibliographic Details
Published inBiomedical Circuits and Systems Conference pp. 1 - 5
Main Authors Deng, Yuanjie, Liu, Yinggang, Wei, Ying
Format Conference Proceeding
LanguageEnglish
Published IEEE 24.10.2024
Subjects
Online AccessGet full text
ISSN2766-4465
DOI10.1109/BioCAS61083.2024.10798276

Cover

Loading…
More Information
Summary:In recent years, speech separation leveraging visual information has made significant progress. To enhance the focus of the model on the target speaker, several studies have incorporated additional acoustic features based on audio-visual fusion. But current methods require additional information about the speaker to extract such features. To address this limitation, we propose a two-stage cross-modal speech separation model, whose speech embeddings extraction module avoids the need of pre-enrolled voice and speaker IDS, thereby expanding the method's applicability. In addition, the proposed multi-level feature fusion can fully mine different levels of features in the model to avoid information loss. We conduct a series of experiments in the public dataset VoxCeleb2. Compared with other researches, our method demonstrates superior performance in evaluating the key indicators of speech separation.
ISSN:2766-4465
DOI:10.1109/BioCAS61083.2024.10798276