Deep Context: End-to-end Contextual Speech Recognition
In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to a...
Saved in:
Published in | 2018 IEEE Spoken Language Technology Workshop (SLT) pp. 418 - 425 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In automatic speech recognition (ASR) what a user says depends on the particular context she is in. Typically, this context is represented as a set of word n-grams. In this work, we present a novel, all-neural, end-to-end (E2E) ASR system that utilizes such context. Our approach, which we refer to as Contextual Listen, Attend and Spell (CLAS) jointly-optimizes the ASR components along with embeddings of the context n-grams. During inference, the CLAS system can be presented with context phrases which might contain-of-vocabulary (OOV) terms not seen during training. We compare our proposed system to a more traditional contextualization approach, which performs shallow-fusion between independently trained LAS and contextual n-gram models during beam search. Across a number of tasks, we find that the proposed CLAS system outperforms the baseline method by as much as 68% relative WER, indicating the advantage of joint optimization over individually trained components. |
---|---|
DOI: | 10.1109/SLT.2018.8639034 |