Audio-Visual Based Online Multi-Source Separation

Meeting or conference assistance is a popular application that typically requires compact configurations of co-located audio and visual sensors. This paper proposes a novel solution for online separation of an unknown and time-varying number of moving sources using only a single microphone array co-...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 30; pp. 1219 - 1234
Main Authors	Ong, Jonah, Vo, Ba Tuong, Nordholm, Sven, Vo, Ba-Ngu, Moratuwage, Diluka, Shim, Changbeom
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Audio-visual Cancellers generalized labeled multi-Bernoulli labeled random finite sets Music Noise measurement Permutations Position measurement Radio frequency Sensors Separation source separation spatial filtering Tracking filters Training Visual observation Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Meeting or conference assistance is a popular application that typically requires compact configurations of co-located audio and visual sensors. This paper proposes a novel solution for online separation of an unknown and time-varying number of moving sources using only a single microphone array co-located with a single visual device. The approach exploits the complementary nature of simultaneous audio and visual measurements, accomplished by a model-centric 3-stage process of detection, tracking, and (spatial) filtering, which performs separation in a block-wise or recursive fashion. Fusing the measurements requires solving the multi-modal space-time permutation problem, since the audio and visual measurements reside in different observation spaces, but also are unidentified or unlabeled (with respect to the unknown and time-varying number of sources), and are subject to noise, extraneous measurements and missing measurements. A labeled random finite set tracking filter is applied to resolve the permutation problem and recursively estimate the source identities and trajectories. A time-varying set of generalized side-lobe cancellers is constructed based on the tracking estimates to perform online separation. Evaluations are undertaken with live human speakers.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2022.3156758