Statistical Transformation of Language and Pronunciation Models for Spontaneous Speech Recognition

We propose a novel approach based on a statistical transformation framework for language and pronunciation modeling of spontaneous speech. Since it is not practical to train a spoken-style model using numerous spoken transcripts, the proposed approach generates a spoken-style model by transforming a...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on audio, speech, and language processing Vol. 18; no. 6; pp. 1539 - 1549
Main Authors	Akita, Yuya, Kawahara, Tatsuya
Format	Journal Article
Language	English
Published	Piscataway, NJ IEEE 01.08.2010 Institute of Electrical and Electronics Engineers
Subjects	Applied sciences Automatic speech recognition Automatic speech recognition (ASR) Error analysis Exact sciences and technology Information, signal and communications theory language model (LM) Minutes Miscellaneous Natural languages Predictive models Probability pronunciation model Signal processing Speech processing Speech recognition spontaneous speech statistical transformation Statistics Telecommunications and information theory Telephony Speech analysis Automatic speech recognition (ASR) Archive Pronunciation Lexicon Error rate Mapping Portability pronunciation model spontaneous speech Modeling statistical transformation Language processing Statistical model Speech recognition language model (LM) Automatic recognition Speech processing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We propose a novel approach based on a statistical transformation framework for language and pronunciation modeling of spontaneous speech. Since it is not practical to train a spoken-style model using numerous spoken transcripts, the proposed approach generates a spoken-style model by transforming an orthographic model trained with document archives such as the minutes of meetings and the proceedings of lectures. The transformation is based on a statistical model estimated using a small amount of a parallel corpus, which consists of faithful transcripts aligned with their orthographic documents. Patterns of transformation, such as substitution, deletion, and insertion of words, are extracted with their word and part-of-speech (POS) contexts, and transformation probabilities are estimated based on occurrence statistics in a parallel aligned corpus. For pronunciation modeling, subword-based mapping between baseforms and surface forms is extracted with their occurrence counts, then a set of rewrite rules with their probabilities are derived as a transformation model. Spoken-style language and pronunciation (surface forms) models can be predicted by applying these transformation patterns to a document-style language model and baseforms in a lexicon, respectively. The transformed models significantly reduced perplexity and word error rates (WERs) in a task of transcribing congressional meetings, even though the domains and topics were different from the parallel corpus. This result demonstrates the generality and portability of the proposed framework.
ISSN:	1558-7916 1558-7924
DOI:	10.1109/TASL.2009.2037400