Context and Uncertainty Modeling for Online Speaker Change Detection

Speaker change detection is often addressed as a key component in speaker diarization systems. In this work we focus on online speaker change detection as a standalone task which is required for online closed captioning of broadcast television. Contrary to related works, we do not operate on frame-l...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8379 - 8383
Main Authors Aronowitz, Hagai, Zhu, Weizhong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Speaker change detection is often addressed as a key component in speaker diarization systems. In this work we focus on online speaker change detection as a standalone task which is required for online closed captioning of broadcast television. Contrary to related works, we do not operate on frame-level features such as MFCC. Instead, we leverage state-of-the-art speaker recognition-based technology by modeling sequences of pretrained speaker embeddings (x-vectors) using a deep neural network. We explicitly address two types of uncertainties. The first one is uncertainty in embedding point estimate which is due to short and varying segment duration. The second type is uncertainty in which context segments are relevant to representing the speaker talking right before the hypothesized speaker change. We also show the robustness of affinity matrix-representation for speaker change detection. Our methods provide very significant accuracy improvements compared to several baselines including a recently published end-to-end system.
ISSN:2379-190X
DOI:10.1109/ICASSP40776.2020.9053280