Domain corpus independent vocabulary generation for embedded continuous speech recognition

This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a do...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on consumer electronics Vol. 55; no. 3; pp. 1631 - 1636
Main Authors	Lim, Minkyu, Kim, Kwang-Ho, Kim, Ji-Hwan
Format	Journal Article
Language	English
Published	New York IEEE 01.08.2009 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Computer science Context modeling Coverage Dictionaries Domain corpus independent Electronics Embedded speech recognition Frequency Inclusions Knowledge bases (artificial intelligence) Natural languages Searching Space exploration Space technology Speech recognition Statistical analysis Statistics Studies Tags Texts Vocabulary
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using part-of-speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, named entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5 K, 10 K, 15 K and 20 K. In particular, the coverage of 15 K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	0098-3063 1558-4127
DOI:	10.1109/TCE.2009.5278036