Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a context...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Huang, Kaixun, Zhang, Ao, Yang, Zhanheng, Guo, Pengcheng, Mu, Bingshen, Xu, Tianyi, Xie, Lei
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 12.07.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Contextual information plays a crucial role in speech recognition technologies and incorporating it into the end-to-end speech recognition models has drawn immense interest recently. However, previous deep bias methods lacked explicit supervision for bias tasks. In this study, we introduce a contextual phrase prediction network for an attention-based deep bias method. This network predicts context phrases in utterances using contextual embeddings and calculates bias loss to assist in the training of the contextualized model. Our method achieved a significant word error rate (WER) reduction across various end-to-end speech recognition models. Experiments on the LibriSpeech corpus show that our proposed model obtains a 12.1% relative WER improvement over the baseline model, and the WER of the context phrases decreases relatively by 40.5%. Moreover, by applying a context phrase filtering strategy, we also effectively eliminate the WER degradation when using a larger biasing list.
ISSN:2331-8422