Accurate and Practical Query-by-Example Using Multiple Deep Learning Models and Frame Compression Methods
Recently, studies of spoken term detection (STD) and spoken query STD (SQ-STD), also known as query-by-example (QbE), have been actively pursued. A representative method of QbE is posteriorgram matching using outputs of deep neural networks. However, that method requires much retrieval time and memo...
Saved in:
Published in | 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) pp. 862 - 867 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
31.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recently, studies of spoken term detection (STD) and spoken query STD (SQ-STD), also known as query-by-example (QbE), have been actively pursued. A representative method of QbE is posteriorgram matching using outputs of deep neural networks. However, that method requires much retrieval time and memory size. To address this difficulty, we proposed a maximum likelihood state sequence method (MLSS) for retrieval time reduction. This paper presents a proposal of two methods named "blank-cut (b-cut)" and "frame de-duplication (FDD)" to compress posteriorgram frames, by which we reduce retrieval times and memory sizes. Multiple matching scores are obtained using multiple deep learning models and architectures in the proposed methods. Then they are integrated. We achieved state-of-the-art retrieval accuracy as shown by evaluation experiments using two open test sets of about 30 hr of speech data. Furthermore, the proposed method achieved a retrieval time of less than 1 s and a memory requirement of about 1 GB. These results demonstrated the effectiveness of the proposed method. |
---|---|
ISSN: | 2640-0103 |
DOI: | 10.1109/APSIPAASC58517.2023.10317220 |