Noise-Aware Extended U-Net With Split Encoder and Feature Refinement Module for Robust Speaker Verification in Noisy Environments

Speech data gathered from real-world environments typically contain noise, a significant element that undermines the performance of deep neural network-based speaker verification (SV) systems. To mitigate performance degradation due to noise and develop noise-robust SV systems, several researchers h...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 111673 - 111682
Main Authors	Lim, Chan-Yeong, Heo, Jungwoo, Kim, Ju-Ho, Shin, Hyun-Seo, Yu, Ha-Jin
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Artificial neural networks Coders Convolution Decoding feature enhancement Feature extraction feature refinement joint training Modules Noise measurement Noise propagation Noise-aware extended U-Net noisy environments Performance degradation Robustness speaker verification Speech enhancement Speech processing Speech recognition split encoder State-of-the-art reviews Systems analysis Training Verification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech data gathered from real-world environments typically contain noise, a significant element that undermines the performance of deep neural network-based speaker verification (SV) systems. To mitigate performance degradation due to noise and develop noise-robust SV systems, several researchers have integrated speech enhancement (SE) and SV systems. We previously proposed the extended U-Net (ExU-Net), which achieved state-of-the-art performance in SV in noisy environments by jointly training SE and SV systems. In the SE field, some studies have shown that recognizing noise components within speech can improve the system's performance. Inspired by these approaches, we propose a noise-aware ExU-Net (NA-ExU-Net) that acknowledges noise information in the SE process based on the ExU-Net architecture. The proposed system comprises a Split Encoder and a feature refinement module (FRM). The Split Encoder handles the speech and noise separately by dividing the encoder blocks, whereas FRM is designed to inhibit the propagation of irrelevant data via skip connections. To validate the effectiveness of our proposed framework in noisy conditions, we evaluated the models on the VoxCeleb1 test set with added noise from the MUSAN corpus. The experimental results demonstrate that NA-ExU-Net outperforms the ExU-Net and other baseline systems under all evaluation conditions. Furthermore, evaluations in out-of-domain noise environments indicate that NA-ExU-Net significantly surpasses existing frameworks, highlighting its robustness and generalization capabilities. The codes utilized in our experiments can be accessed at https://github.com/chan-yeong0519/NA-ExU-Net .
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3433465