Fast Query-by-Example Speech Search Using Attention-Based Deep Binary Embeddings

State-of-the-art query-by-example (QbE) speech search approaches usually use recurrent neural network (RNN) based acoustic word embeddings (AWEs) to represent variable-length speech segments with fixed-dimensional vectors, and thus simple cosine distances can be measured over the embedded vectors of...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on audio, speech, and language processing Vol. 28; pp. 1988 - 2000
Main Authors Yuan, Yougen, Xie, Lei, Leung, Cheung-Chi, Chen, Hongjie, Ma, Bin
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:State-of-the-art query-by-example (QbE) speech search approaches usually use recurrent neural network (RNN) based acoustic word embeddings (AWEs) to represent variable-length speech segments with fixed-dimensional vectors, and thus simple cosine distances can be measured over the embedded vectors of both the spoken query and the search content. In this paper, we aim to improve search accuracy and speed for the AWE-based QbE approach in low-resource scenario. First, multi-head self-attentive mechanism is introduced for learning a sequence of attention weights for all time steps of RNN outputs while attending to different positions of a speech segment. Second, as the real-valued AWEs suffer from substantial computation in similarity measure, a hashing layer is adopted for learning deep binary embeddings, and thus binary pattern matching can be directly used for fast QbE speech search. The proposed approach of self-attentive deep hashing network is effectively trained with three specifically-designed objectives: a penalization term, a triplet loss, and a quantization loss. Experiments show that our approach improves the relative search speed by 8 times and mean average precision (MAP) by 18.9%, as compared with the previous best real-valued embedding approach.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2020.2998277