Towards Fast and Accurate Streaming End-To-End ASR

End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural n...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6069 - 6073
Main Authors	Li, Bo, Chang, Shuo-yiin, Sainath, Tara N., Pang, Ruoming, He, Yanzhang, Strohman, Trevor, Wu, Yonghui
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2020
Subjects	Acoustics Decoding Endpointer Error analysis Latency Recurrent neural networks RNN-T Speech recognition Training Transducers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNN-T) as a streaming E2E model has shown promising potential for on-device ASR [1]. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model [2] with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique [3], we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over [2] on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer [4]. Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP [2] model.
ISSN:	2379-190X
DOI:	10.1109/ICASSP40776.2020.9054715