Active Speaker Detection Using Audio, Visual, and Depth Modalities: A Survey

The rapid progress of multimodal signal processing in recent years has cleared the way for novel applications in human-computer interaction, surveillance, and telecommunication. Active Speaker Detection (ASD) is a critical pre-processing step with numerous applications such as voice recognition, spe...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 12; pp. 96617 - 96634
Main Authors Nur Aisyah Mohd Robi, Siti, Atiff Zakwan Mohd Ariffin, Muhammad, Mohd Izhar, Mohd Azri, Ahmad, Norulhusna, Mad Kaidi, Hazilah
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The rapid progress of multimodal signal processing in recent years has cleared the way for novel applications in human-computer interaction, surveillance, and telecommunication. Active Speaker Detection (ASD) is a critical pre-processing step with numerous applications such as voice recognition, speaker diarization, and noise reduction. This paper comprehensively reviews ASD, including various ASD methods and datasets based on these three modalities - audio, visual and/or depth modalities. ASD methods are broadly categorised into two categories: single modality ASD and multi-modality ASD. This review looks at the most common ASD modalities, which include audio-based ASD (A-ASD), visual-based ASD (V-ASD), audio-visual ASD (AV-ASD), and audio-visual-depth ASD (AVD-ASD). Each strategy is well-detailed, including model-based and neural network-based approaches. Finally, the challenges and future research opportunities are highlighted in order to expand its broader use.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3426670