Hybrid end-to-end model for Kazakh speech recognition
Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is prob...
Saved in:
Published in | International journal of speech technology Vol. 26; no. 2; pp. 261 - 270 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.07.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition. |
---|---|
ISSN: | 1381-2416 1572-8110 |
DOI: | 10.1007/s10772-022-09983-8 |