Audio-Video detection of the active speaker in meetings

Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using spe...

Full description

Saved in:

Bibliographic Details
Published in	2020 25th International Conference on Pattern Recognition (ICPR) pp. 2536 - 2543
Main Authors	Madrigal, Francisco, Lerasle, Frederic, Pibre, Lionel, Ferrane, Isabelle
Format	Conference Proceeding
Language	English
Published	IEEE 10.01.2021
Subjects	Active speaker detection Audiovisual modeling Benchmark testing Cognition Convolutional Networks Feature extraction Feature fusion Human-robot interaction Pattern recognition Proposals Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public benchmark in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection.
DOI:	10.1109/ICPR48806.2021.9412681