Dual Application of Speech Enhancement for Automatic Speech Recognition

In this work, we exploit speech enhancement for improving a re-current neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation tech...

Full description

Saved in:
Bibliographic Details
Published in2021 IEEE Spoken Language Technology Workshop (SLT) pp. 223 - 228
Main Authors Pandey, Ashutosh, Liu, Chunxi, Wang, Yun, Saraf, Yatharth
Format Conference Proceeding
LanguageEnglish
Published IEEE 19.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this work, we exploit speech enhancement for improving a re-current neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergence based consistency loss that is computed between the ASR outputs of original and enhanced utterances. In using speech enhancement as an effective ASR frontend, we propose a three-step training scheme based on model pretraining and feature selection. We evaluate our proposed techniques on a challenging social media English video dataset, and achieve an average relative improvement of 11.2% with speech enhancement based data augmentation, 8.3% with enhancement based preprocessing, and 13.4% when combining both.
DOI:10.1109/SLT48900.2021.9383624