Context and Uncertainty Modeling for Online Speaker Change Detection

Speaker change detection is often addressed as a key component in speaker diarization systems. In this work we focus on online speaker change detection as a standalone task which is required for online closed captioning of broadcast television. Contrary to related works, we do not operate on frame-l...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8379 - 8383
Main Authors	Aronowitz, Hagai, Zhu, Weizhong
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2020
Subjects	affinity matrix Context modeling duration modeling Neural networks Online speaker change detection Signal processing Speech processing Task analysis Uncertainty uncertainty modeling in deep learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speaker change detection is often addressed as a key component in speaker diarization systems. In this work we focus on online speaker change detection as a standalone task which is required for online closed captioning of broadcast television. Contrary to related works, we do not operate on frame-level features such as MFCC. Instead, we leverage state-of-the-art speaker recognition-based technology by modeling sequences of pretrained speaker embeddings (x-vectors) using a deep neural network. We explicitly address two types of uncertainties. The first one is uncertainty in embedding point estimate which is due to short and varying segment duration. The second type is uncertainty in which context segments are relevant to representing the speaker talking right before the hypothesized speaker change. We also show the robustness of affinity matrix-representation for speaker change detection. Our methods provide very significant accuracy improvements compared to several baselines including a recently published end-to-end system.
ISSN:	2379-190X
DOI:	10.1109/ICASSP40776.2020.9053280