Research on End-to-end Tibetan Speech Recognition Acoustic Model Based on Multi-scale Features

Tibetan is one of the important languages of China's ethnic minorities, with rich cultural and historical value. However, Tibetan speech recognition is a challenging task due to the complexity of its phonetic features and the scarcity of data. Although some research results have been achieved,...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML) pp. 458 - 462
Main Authors	Wang, Jiawen, Gao, Dingguo, Suolang, Quzhen
Format	Conference Proceeding
Language	English
Published	IEEE 04.08.2023
Subjects	Acoustic model Acoustics End-to-end model Error analysis Feature extraction Machine learning Multiscale features Phonetics Predictive models Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Tibetan is one of the important languages of China's ethnic minorities, with rich cultural and historical value. However, Tibetan speech recognition is a challenging task due to the complexity of its phonetic features and the scarcity of data. Although some research results have been achieved, there is still a large room for improvement. In this paper, we propose an end-to-end Tibetan speech recognition acoustic model based on multiscale features, aiming at the problem that the non-encoder-decoder model widely used in the acoustic model of Tibetan speech recognition experiment leads to poor recognition effect of speech recognition task with prediction sequence information. We compare the baseline model based on the attention-based encoder-decoder speech recognition framework with four Tibetan speech recognition acoustic models, and then we improve the baseline model by using a hybrid loss function and multi-scale features for feature extraction. The experimental results show the feasibility of attention-based encoder-decoder model for Tibetan speech recognition, and that using hybrid loss function and multiscale features can improve the recognition performance of the model. The model proposed in this paper has the best effect in the recognition of Tibetan Lhasa dialect at present, and the word error rate of test set is only 15.04%.
DOI:	10.1109/PRML59573.2023.10348322