Query-by-example Spoken Term Detection For OOV terms

The goal of spoken term detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech st...

Full description

Saved in:
Bibliographic Details
Published in2009 IEEE Workshop on Automatic Speech Recognition & Understanding pp. 404 - 409
Main Authors Parada, C., Sethy, A., Ramabhadran, B.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2009
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The goal of spoken term detection (STD) technology is to allow open vocabulary search over large collections of speech content. In this paper, we address cases where search term(s) of interest (queries) are acoustic examples. This is provided either by identifying a region of interest in a speech stream or by speaking the query term. Queries often relate to named-entities and foreign words, which typically have poor coverage in the vocabulary of large vocabulary continuous speech recognition (LVCSR) systems. Throughout this paper, we focus on query-by-example search for such out-of-vocabulary (OOV) query terms. We build upon a finite state transducer (FST) based search and indexing system to address the query by example search for OOV terms by representing both the query and the index as phonetic lattices from the output of an LVCSR system. We provide results comparing different representations and generation mechanisms for both queries and indexes built with word and combined word and subword units. We also present a two-pass method which uses query-by-example search using the best hit identified in an initial pass to augment the STD search results. The results demonstrate that query-by-example search can yield a significantly better performance, measured using actual term-weighted value (ATWV), of 0.479 when compared to a baseline ATWV of 0.325 that uses reference pronunciations for OOVs. Further improvements can be obtained with the proposed two pass approach and filtering using the expected unigram counts from the LVCSR system's lexicon.
ISBN:1424454786
9781424454785
DOI:10.1109/ASRU.2009.5373341