DCTX-Conformer: Dynamic context carry-over for low latency unified streaming and non-streaming Conformer ASR
Conformer-based end-to-end models have become ubiquitous these days and are commonly used in both streaming and non-streaming automatic speech recognition (ASR). Techniques like dual-mode and dynamic chunk training helped unify streaming and non-streaming systems. However, there remains a performanc...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
13.06.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Conformer-based end-to-end models have become ubiquitous these days and are
commonly used in both streaming and non-streaming automatic speech recognition
(ASR). Techniques like dual-mode and dynamic chunk training helped unify
streaming and non-streaming systems. However, there remains a performance gap
between streaming with a full and limited past context. To address this issue,
we propose the integration of a novel dynamic contextual carry-over mechanism
in a state-of-the-art (SOTA) unified ASR system. Our proposed dynamic context
Conformer (DCTX-Conformer) utilizes a non-overlapping contextual carry-over
mechanism that takes into account both the left context of a chunk and one or
more preceding context embeddings. We outperform the SOTA by a relative 25.0%
word error rate, with a negligible latency impact due to the additional context
embeddings. |
---|---|
DOI: | 10.48550/arxiv.2306.08175 |