Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for th...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors	Torcoli, Matteo, Habets, Emanuel A. P.
Format	Conference Proceeding
Language	English
Published	IEEE 04.06.2023
Subjects	Acoustics Broadcast Audio Dialogue Separation DNNs Post-Processing Production Signal processing Training Voice activity detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10095153