Audio-Video detection of the active speaker in meetings

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using spe...

Full description

Saved in:
Bibliographic Details
Published in2020 25th International Conference on Pattern Recognition (ICPR) pp. 2536 - 2543
Main Authors Madrigal, Francisco, Lerasle, Frederic, Pibre, Lionel, Ferrane, Isabelle
Format Conference Proceeding
LanguageEnglish
Published IEEE 10.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public benchmark in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.
DOI:10.1109/ICPR48806.2021.9412681