Audio-Video detection of the active speaker in meetings
Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using spe...
Saved in:
Published in | 2020 25th International Conference on Pattern Recognition (ICPR) pp. 2536 - 2543 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
10.01.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Meetings are a common activity that provides certain challenges when creating systems that assist them. Such is the case of the Active speaker detection, which can provide useful information for human interaction modeling, or human-robot interaction. Active speaker detection is mostly done using speech, however, certain visual and contextual information can provide additional insights. In this paper we propose an active speaker detection framework that integrates audiovisual features with social information, from the meeting context. Visual cue is processed using a Convolutional Neural Network (CNN) that captures the spatio-temporal relationships. We analyze several CNN architectures with both cues: raw pixels (RGB images) and motion (estimated with optical flow). Contextual reasoning is done with an original methodology, based on the gaze of all participants. We evaluate our proposal with a public benchmark in state-of-art: AMI corpus. We show how the addition of visual and context information improves the performance of the active speaker detection. |
---|---|
DOI: | 10.1109/ICPR48806.2021.9412681 |