Detection of Arbitrary Wake Words by Coupling a Phoneme Predictor and a Phoneme Sequence Detector

Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the de...

Full description

Saved in:

Bibliographic Details
Published in	APSIPA transactions on signal and information processing Vol. 13; no. 1
Main Authors	Nishimura, Ryota, Uno, Takaaki, Yamamoto, Taiki, Ohta, Kengo, Kitaoka, Norihide
Format	Journal Article
Language	English
Published	Boston — Delft Now Publishers 01.01.2024 Now Publishers Inc
Subjects	Adaptive signal processing Communications and Information Theory Engineering Phonemes Signal Processing Signal processing for communications Speech Speech and spoken language processing Speech/audio/image/video compression Statistical signal processing Technology
Online Access	Get full text
ISSN	2048-7703 2048-7703
DOI	10.1561/116.20240014

Cover

More Information
Summary:	Most wake word (WW) detection systems used in smartphones and smart speakers only detect specific, predefined WWs such as “Hey, Siri” or “OK, Google”. To build such a system, a large speech corpus consisting of many examples of the selected WWs must be collected to train the model. If we want the device to detect a different WW, collection of a new speech corpus and re-training of the model are required. In this study, we propose a system which is capable of detecting any chosen WW without additional model training or a corpus of WW utterances, allowing users to select and use their preferred WW. Our system consists of a phoneme predictor (PP) and a phoneme sequence detector (PSD). The PP predicts phoneme sequences using acoustic features of the input speech, and outputs phoneme probability distributions. The acoustic models in the PP are trained using the Connectionist Temporal Classification (CTC) loss criterion. The PSD takes the output of the PP as input, and predicts the probability of whether or not the WW has been input. In our evaluation experiments, we performed six-phoneme WW detection. Our results showed that the proposed method achieved 90% WW detection accuracy.
Bibliography:	CTC SIP-20240014 Wake word Now Publishers end-to-end modeling phoneme sequence detector ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2048-7703 2048-7703
DOI:	10.1561/116.20240014