Hybrid end-to-end model for Kazakh speech recognition

Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is prob...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of speech technology Vol. 26; no. 2; pp. 261 - 270
Main Authors Mamyrbayev, Orken Zh, Oralbekova, Dina O., Alimhan, Keylan, Nuranbayeva, Bulbul M.
Format Journal Article
LanguageEnglish
Published New York Springer US 01.07.2023
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-022-09983-8