Accurate and Practical Query-by-Example Using Multiple Deep Learning Models and Frame Compression Methods

Recently, studies of spoken term detection (STD) and spoken query STD (SQ-STD), also known as query-by-example (QbE), have been actively pursued. A representative method of QbE is posteriorgram matching using outputs of deep neural networks. However, that method requires much retrieval time and memo...

Full description

Saved in:
Bibliographic Details
Published in2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) pp. 862 - 867
Main Authors Yamaga, Hikaru, Hatakeyama, Kazuki, Kojima, Kazunori, Lee, Shi-Wook, Itoh, Yoshiaki
Format Conference Proceeding
LanguageEnglish
Published IEEE 31.10.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recently, studies of spoken term detection (STD) and spoken query STD (SQ-STD), also known as query-by-example (QbE), have been actively pursued. A representative method of QbE is posteriorgram matching using outputs of deep neural networks. However, that method requires much retrieval time and memory size. To address this difficulty, we proposed a maximum likelihood state sequence method (MLSS) for retrieval time reduction. This paper presents a proposal of two methods named "blank-cut (b-cut)" and "frame de-duplication (FDD)" to compress posteriorgram frames, by which we reduce retrieval times and memory sizes. Multiple matching scores are obtained using multiple deep learning models and architectures in the proposed methods. Then they are integrated. We achieved state-of-the-art retrieval accuracy as shown by evaluation experiments using two open test sets of about 30 hr of speech data. Furthermore, the proposed method achieved a retrieval time of less than 1 s and a memory requirement of about 1 GB. These results demonstrated the effectiveness of the proposed method.
ISSN:2640-0103
DOI:10.1109/APSIPAASC58517.2023.10317220