Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the de...

Full description

Saved in:
Bibliographic Details
Published inAPSIPA transactions on signal and information processing Vol. 13; no. 1
Main Authors Nishimura, Ryota, Uno, Takaaki, Yamamoto, Taiki, Ohta, Kengo, Kitaoka, Norihide
Format Journal Article
LanguageEnglish
Published Boston — Delft Now Publishers 01.01.2024
Now Publishers Inc
Subjects
Online AccessGet full text
ISSN2048-7703
2048-7703
DOI10.1561/116.20240014

Cover

More Information
Summary:Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the device to detect a different WW, collection of a new speech corpus and re-training of the model are required. In this study, we propose a system which is capable of detecting any chosen WW without additional model training or a corpus of WW utterances, allowing users to select and use their preferred WW. Our system consists of a phoneme predictor (PP) and a phoneme sequence detector (PSD). The PP predicts phoneme sequences using acoustic features of the input speech, and outputs phoneme probability distributions. The acoustic models in the PP are trained using the Connectionist Temporal Classification (CTC) loss criterion. The PSD takes the output of the PP as input, and predicts the probability of whether or not the WW has been input. In our evaluation experiments, we performed six-phoneme WW detection. Our results showed that the proposed method achieved 90% WW detection accuracy.
Bibliography:CTC
SIP-20240014
Wake word
Now Publishers
end-to-end modeling
phoneme sequence detector
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2048-7703
2048-7703
DOI:10.1561/116.20240014