Room-localized speech activity detection in multi-microphone smart homes

Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a cruci...

Full description

Saved in:

Bibliographic Details
Published in	EURASIP journal on audio, speech, and music processing Vol. 2019; no. 1; pp. 1 - 23
Main Authors	Giannoulis, Panagiotis, Potamianos, Gerasimos, Maragos, Petros
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 27.08.2019 Springer Nature B.V SpringerOpen
Subjects	Acoustics Active room selection Algorithms Computer simulation Engineering Engineering Acoustics Machine learning Mathematics in Music Microphone arrays Microphones Multi-channel fusion Research projects Segmentation Signal,Image and Speech Processing Smart buildings Smart homes Smart houses Speech activity detection Speech recognition Statistical models Subsystems Voice activity detectors Speech activity detection Microphone arrays Active room selection Multi-channel fusion Smart homes
Online Access	Get full text
ISSN	1687-4722 1687-4714 1687-4722
DOI	10.1186/s13636-019-0158-8

Cover

Loading…

More Information
Summary:	Voice-enabled interaction systems in domestic environments have attracted significant interest recently, being the focus of smart home research projects and commercial voice assistant home devices. Within the multi-module pipelines of such systems, speech activity detection (SAD) constitutes a crucial component, providing input to their activation and speech recognition subsystems. In typical multi-room domestic environments, SAD may also convey spatial intelligence to the interaction, in addition to its traditional temporal segmentation output, by assigning speech activity at the room level. Such room-localized SAD can, for example, disambiguate user command referents, allow localized system feedback, and enable parallel voice interaction sessions by multiple subjects in different rooms. In this paper, we investigate a room-localized SAD system for smart homes equipped with multiple microphones distributed in multiple rooms, significantly extending our earlier work. The system employs a two-stage algorithm, incorporating a set of hand-crafted features specially designed to discriminate room-inside vs. room-outside speech at its second stage, refining SAD hypotheses obtained at its first stage by traditional statistical modeling and acoustic front-end processing. Both algorithmic stages exploit multi-microphone information, combining it at the signal, feature, or decision level. The proposed approach is extensively evaluated on both simulated and real data recorded in a multi-room, multi-microphone smart home, significantly outperforming alternative baselines. Further, it remains robust to reduced microphone setups, while also comparing favorably to deep learning-based alternatives.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1687-4722 1687-4714 1687-4722
DOI:	10.1186/s13636-019-0158-8