AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio–visual deepfakes detection
Recent advances in the field of machine learning and social media platforms facilitate the creation and rapid dissemination of realistic fake content (i.e., images, videos, audios). Initially, the fake content generation involved the manipulation of either audio or video streams but currently, more...
Saved in:
Published in | Applied soft computing Vol. 136; p. 110124 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.03.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recent advances in the field of machine learning and social media platforms facilitate the creation and rapid dissemination of realistic fake content (i.e., images, videos, audios). Initially, the fake content generation involved the manipulation of either audio or video streams but currently, more realistic deepfakes content is being produced via modifying both audio–visual streams. Researchers in the field of deepfakes detection mostly focus on identifying fake videos exploiting solely visual or audio modality. However, there exist a few approaches for audio–visual deepfakes detection but mostly are not evaluated on a multimodal dataset with deepfakes videos having the manipulations in both streams. The unified approaches evaluated on the audio–visual deepfakes dataset have reported low detection accuracies and failed when the faces are side-posed. Therefore, in this paper, we introduced a novel AVFakeNet framework that focuses on both the audio and visual modalities of a video for deepfakes detection. More specifically, our unified AVFakeNet model is a novel Dense Swin Transformer Net (DST-Net) which consists of an input block, feature extraction block, and output block. The input and output block comprises dense layers while the feature extraction block employs a customized swin transformer module. We have performed extensive experimentation on five different datasets (FakeAVCeleb, Celeb-DF, ASVSpoof-2019 LA, World Leaders dataset, Presidential Deepfakes dataset) comprising audio, visual, and audio–visual deepfakes along with a cross-corpora evaluation to signify the effectiveness and generalizability of our unified framework. Experimental results highlight the effectiveness of the proposed framework in terms of accurately detecting deepfakes videos via scrutinizing both the audio and visual streams.
•Propose a unified framework AVFakeNet to accurately detect the manipulation in the audio–visual streams of deepfakes video.•Propose a Dense Swin Transformer Net that computes the dense hierarchical features maps to better represent the input videos.•AVFakeNet is robust against the videos with angled or side-posed faces having varied illumination conditions and ethnicities.•Performed extensive experimentation and cross-corpora evaluation to signify the efficacy and generalizability of our model. |
---|---|
ISSN: | 1568-4946 1872-9681 |
DOI: | 10.1016/j.asoc.2023.110124 |