Hybrid end-to-end model for Kazakh speech recognition

Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is prob...

Full description

Saved in:

Bibliographic Details
Published in	International journal of speech technology Vol. 26; no. 2; pp. 261 - 270
Main Authors	Mamyrbayev, Orken Zh, Oralbekova, Dina O., Alimhan, Keylan, Nuranbayeva, Bulbul M.
Format	Journal Article
Language	English
Published	New York Springer US 01.07.2023 Springer Nature B.V
Subjects	Artificial Intelligence Automatic speech recognition Corpus linguistics Engineering Kazakh language Language Language modeling Performance enhancement Signal,Image and Speech Processing Social Sciences Speech Speech recognition Training Voice recognition Connectionist temporal classification End-to-end Attention Low resource language Automatic speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.
ISSN:	1381-2416 1572-8110
DOI:	10.1007/s10772-022-09983-8