A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, t...

Full description

Saved in:

Bibliographic Details
Published in	IEICE Transactions on Information and Systems Vol. E93.D; no. 9; pp. 2379 - 2387
Main Authors	OH, Yoo Rhee, KIM, Hong Kook
Format	Journal Article
Language	English
Published	The Institute of Electronics, Information and Communication Engineers 2010
Subjects	acoustic model adaptation Acoustics Adaptation Classification Clustering Dictionaries Mathematical models non-native speech recognition pronunciation model adaptation pronunciation variability Speech Speech recognition state-tying level hybrid adaptation triphone-modeling level hybrid adaptation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0916-8532 1745-1361 1745-1361
DOI:	10.1587/transinf.E93.D.2379