Domain corpus independent vocabulary generation for embedded continuous speech recognition

This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a do...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on consumer electronics Vol. 55; no. 3; pp. 1631 - 1636
Main Authors Lim, Minkyu, Kim, Kwang-Ho, Kim, Ji-Hwan
Format Journal Article
LanguageEnglish
Published New York IEEE 01.08.2009
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using part-of-speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, named entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5 K, 10 K, 15 K and 20 K. In particular, the coverage of 15 K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ISSN:0098-3063
1558-4127
DOI:10.1109/TCE.2009.5278036