An Overview of End-to-End Automatic Speech Recognition

Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neur...

Full description

Saved in:

Bibliographic Details
Published in	Symmetry (Basel) Vol. 11; no. 8; p. 1018
Main Authors	Wang, Dong, Wang, Xiaodong, Lv, Shaohe
Format	Journal Article
Language	English
Published	Basel MDPI AG 2019
Subjects	Acoustics Alignment Alliances Artificial neural networks Automatic speech recognition Continuous speech Deep learning Hypotheses Laboratories Machine learning Markov analysis Markov chains Neural networks Pattern recognition Recurrent neural networks Segmentation Speech Speech recognition Training Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatic speech recognition, especially large vocabulary continuous speech recognition, is an important issue in the field of machine learning. For a long time, the hidden Markov model (HMM)-Gaussian mixed model (GMM) has been the mainstream speech recognition framework. But recently, HMM-deep neural network (DNN) model and the end-to-end model using deep learning has achieved performance beyond HMM-GMM. Both using deep learning techniques, these two models have comparable performances. However, the HMM-DNN model itself is limited by various unfavorable factors such as data forced segmentation alignment, independent hypothesis, and multi-module individual training inherited from HMM, while the end-to-end model has a simplified model, joint training, direct output, no need to force data alignment and other advantages. Therefore, the end-to-end model is an important research direction of speech recognition. In this paper we review the development of end-to-end model. This paper first introduces the basic ideas, advantages and disadvantages of HMM-based model and end-to-end models, and points out that end-to-end model is the development direction of speech recognition. Then the article focuses on the principles, progress and research hotspots of three different end-to-end models, which are connectionist temporal classification (CTC)-based, recurrent neural network (RNN)-transducer and attention-based, and makes theoretically and experimentally detailed comparisons. Their respective advantages and disadvantages and the possible future development of the end-to-end model are finally pointed out. Automatic speech recognition is a pattern recognition task in the field of computer science, which is a subject area of Symmetry.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2073-8994 2073-8994
DOI:	10.3390/sym11081018