Modeling Turn-Taking in Human-To-Human Spoken Dialogue Datasets Using Self-Supervised Features

Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we intro-...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors Morais, Edmilson, Damasceno, Matheus, Aronowitz, Hagai, Satt, Aharon, Hoory, Ron
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Self-supervised pre-trained models have consistently delivered state-of-art results in the fields of natural language and speech processing. However, we argue that their merits for modeling Turn-Taking for spoken dialogue systems still need further investigation. Due to that, in this paper we intro-duce a modular End-to-End system based on an Upstream + Downstream architecture paradigm, which allows easy use/integration of a large variety of self-supervised features to model the specific Turn-Taking task of End-of-Turn Detection (EOTD). Several architectures to model the EOTD task using audio-only, text-only and audio+text modalities are presented, and their performance and robustness are carefully evaluated for three different human-to-human spoken dialogue datasets. The proposed model not only achieves SOTA results for EOTD, but also brings light to the possibility of powerful and well fine-tuned self-supervised models to be successfully used for a wide variety Turn-Taking tasks.
ISSN:2379-190X
DOI:10.1109/ICASSP49357.2023.10096775