Domain corpus independent vocabulary generation for embedded continuous speech recognition
This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a do...
Saved in:
Published in | IEEE transactions on consumer electronics Vol. 55; no. 3; pp. 1631 - 1636 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.08.2009
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This paper proposes a domain corpus independent vocabulary generation algorithm in order to improve the coverage of vocabulary for embedded continuous speech recognition (CSR). A vocabulary in CSR is normally derived from a word frequency list. Therefore, the vocabulary coverage is dependent on a domain corpus. We present an improved way of vocabulary generation using part-of-speech (POS) tagged corpus and knowledge base. We investigate 152 POS tags defined in a POS tagged corpus and word-POS tag pairs. We analyze all words paired with 101 among 152 POS tags and decide on a set of words which have to be included in vocabularies of any size. The other 51 POS tags are mainly categorized with noun-related, named entity (NE)-related and verb-related POSs. We introduce a domain corpus independent word inclusion method for noun-, verb-, and NE-related POS tags using knowledge base. For noun-related POS tags, we generate synonym groups and analyze their relative importance using Google search. Then, we categorize verbs by lemma and analyze relative importance of each lemma from a pre-analyzed statistic for verbs. We determine the inclusion order of NEs through Google search. The proposed method shows at least 28.6% relative improvement of coverage for a SMS text corpus when the sizes of vocabulary are 5 K, 10 K, 15 K and 20 K. In particular, the coverage of 15 K size vocabulary generated by the proposed method reaches up to 97.8% with the relative improvement of 44.2%. |
---|---|
Bibliography: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
ISSN: | 0098-3063 1558-4127 |
DOI: | 10.1109/TCE.2009.5278036 |