Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss
In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
06.02.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper we present an end-to-end speech recognition model with
Transformer encoders that can be used in a streaming speech recognition system.
Transformer computation blocks based on self-attention are used to encode both
audio and label sequences independently. The activations from both audio and
label encoders are combined with a feed-forward layer to compute a probability
distribution over the label space for every combination of acoustic frame
position and label history. This is similar to the Recurrent Neural Network
Transducer (RNN-T) model, which uses RNNs for information encoding instead of
Transformer encoders. The model is trained with the RNN-T loss well-suited to
streaming decoding. We present results on the LibriSpeech dataset showing that
limiting the left context for self-attention in the Transformer layers makes
decoding computationally tractable for streaming, with only a slight
degradation in accuracy. We also show that the full attention version of our
model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our
results also show that we can bridge the gap between full attention and limited
attention versions of our model by attending to a limited number of future
frames. |
---|---|
DOI: | 10.48550/arxiv.2002.02562 |