Attention-based Contextual Language Model Adaptation for Speech Recognition

Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance was spoken, provides a rich input signal. We introduce an att...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Richard Diehl Martinez, Novotney, Scott, Bulyko, Ivan, Rastrow, Ariya, Stolcke, Andreas, Gandhe, Ankur
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 02.06.2021
Subjects	Automatic speech recognition Datasets Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Language modeling (LM) for automatic speech recognition (ASR) does not usually incorporate utterance level contextual information. For some domains like voice assistants, however, additional context, such as the time at which an utterance was spoken, provides a rich input signal. We introduce an attention mechanism for training neural speech recognition language models on both text and non-linguistic contextual data. When applied to a large de-identified dataset of utterances collected by a popular voice assistant platform, our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information. When evaluated on utterances extracted from the long tail of the dataset, our method improves perplexity by 9.0% relative over a standard LM and by over 2.8% relative when compared to a state-of-the-art model for contextual LM.
ISSN:	2331-8422