AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio–visual​ deepfakes detection

Recent advances in the field of machine learning and social media platforms facilitate the creation and rapid dissemination of realistic fake content (i.e., images, videos, audios). Initially, the fake content generation involved the manipulation of either audio or video streams but currently, more...

Full description

Saved in:
Bibliographic Details
Published inApplied soft computing Vol. 136; p. 110124
Main Authors Ilyas, Hafsa, Javed, Ali, Malik, Khalid Mahmood
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.03.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recent advances in the field of machine learning and social media platforms facilitate the creation and rapid dissemination of realistic fake content (i.e., images, videos, audios). Initially, the fake content generation involved the manipulation of either audio or video streams but currently, more realistic deepfakes content is being produced via modifying both audio–visual streams. Researchers in the field of deepfakes detection mostly focus on identifying fake videos exploiting solely visual or audio modality. However, there exist a few approaches for audio–visual deepfakes detection but mostly are not evaluated on a multimodal dataset with deepfakes videos having the manipulations in both streams. The unified approaches evaluated on the audio–visual deepfakes dataset have reported low detection accuracies and failed when the faces are side-posed. Therefore, in this paper, we introduced a novel AVFakeNet framework that focuses on both the audio and visual modalities of a video for deepfakes detection. More specifically, our unified AVFakeNet model is a novel Dense Swin Transformer Net (DST-Net) which consists of an input block, feature extraction block, and output block. The input and output block comprises dense layers while the feature extraction block employs a customized swin transformer module. We have performed extensive experimentation on five different datasets (FakeAVCeleb, Celeb-DF, ASVSpoof-2019 LA, World Leaders dataset, Presidential Deepfakes dataset) comprising audio, visual, and audio–visual deepfakes along with a cross-corpora evaluation to signify the effectiveness and generalizability of our unified framework. Experimental results highlight the effectiveness of the proposed framework in terms of accurately detecting deepfakes videos via scrutinizing both the audio and visual streams. •Propose a unified framework AVFakeNet to accurately detect the manipulation in the audio–visual streams of deepfakes video.•Propose a Dense Swin Transformer Net that computes the dense hierarchical features maps to better represent the input videos.•AVFakeNet is robust against the videos with angled or side-posed faces having varied illumination conditions and ethnicities.•Performed extensive experimentation and cross-corpora evaluation to signify the efficacy and generalizability of our model.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2023.110124