Dual Application of Speech Enhancement for Automatic Speech Recognition

In this work, we exploit speech enhancement for improving a re-current neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation tech...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE Spoken Language Technology Workshop (SLT) pp. 223 - 228
Main Authors	Pandey, Ashutosh, Liu, Chunxi, Wang, Yun, Saraf, Yatharth
Format	Conference Proceeding
Language	English
Published	IEEE 19.01.2021
Subjects	complex spectral mapping Conferences consistency loss Feature extraction Neural networks recur-rent neural network transducer Social networking (online) Speech enhancement speech recognition Training Transducers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this work, we exploit speech enhancement for improving a re-current neural network transducer (RNN-T) based ASR system. We employ a dense convolutional recurrent network (DCRN) for complex spectral mapping based speech enhancement, and find it helpful for ASR in two ways: a data augmentation technique, and a preprocessing frontend. In using it for ASR data augmentation, we exploit a KL divergence based consistency loss that is computed between the ASR outputs of original and enhanced utterances. In using speech enhancement as an effective ASR frontend, we propose a three-step training scheme based on model pretraining and feature selection. We evaluate our proposed techniques on a challenging social media English video dataset, and achieve an average relative improvement of 11.2% with speech enhancement based data augmentation, 8.3% with enhancement based preprocessing, and 13.4% when combining both.
DOI:	10.1109/SLT48900.2021.9383624