Deep Context: End-to-end Contextual Speech Recognition

In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to a...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE Spoken Language Technology Workshop (SLT) pp. 418 - 425
Main Authors Pundak, Golan, Sainath, Tara N., Prabhavalkar, Rohit, Kannan, Anjuli, Zhao, Ding
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointly-optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain-of-vocabulary (OOV) terms not seen during training. We compare our proposed system to a more traditional contextualization approach, which performs shallow-fusion between independently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the proposed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components.
DOI:10.1109/SLT.2018.8639034