Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training dat...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Zhang, Xiaohui, Zhang, Frank, Liu, Chunxi, Schubert, Kjell, Chan, Julian, Prakash, Pradyot, Liu, Jun, Ching-Feng Yeh, Peng, Fuchun, Saraf, Yatharth, Zweig, Geoffrey
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 09.11.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this work, to measure the accuracy and efficiency for a latency-controlled streaming automatic speech recognition (ASR) application, we perform comprehensive evaluations on three popular training criteria: LF-MMI, CTC and RNN-T. In transcribing social media videos of 7 languages with training data 3K-14K hours, we conduct large-scale controlled experimentation across each criterion using identical datasets and encoder model architecture. We find that RNN-T has consistent wins in ASR accuracy, while CTC models excel at inference efficiency. Moreover, we selectively examine various modeling strategies for different training criteria, including modeling units, encoder architectures, pre-training, etc. Given such large-scale real-world streaming ASR application, to our best knowledge, we present the first comprehensive benchmark on these three widely used training criteria across a great many languages.
ISSN:2331-8422