Recurrent Convolutional Structures for Audio Spoof and Video Deepfake Detection

Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make ren...

Full description

Saved in:

Bibliographic Details
Published in	IEEE journal of selected topics in signal processing Vol. 14; no. 5; pp. 1024 - 1037
Main Authors	Chintha, Akash, Thai, Bao, Sohrawardi, Saniat Javid, Bhatt, Kartavya, Hickerson, Andrea, Wright, Matthew, Ptucha, Raymond
Format	Journal Article
Language	English
Published	New York IEEE 01.08.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Audio data Computer architecture Convolution Cost function Datasets Deception deep learning deepfake Digital media Entropy Face Feature extraction Forgery Information integrity Personal computers Representations spoof Videos
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deepfakes, or artificially generated audiovisual renderings, can be used to defame a public figure or influence public opinion. With the recent discovery of generative adversarial networks, an attacker using a normal desktop computer fitted with an off-the-shelf graphics processing unit can make renditions realistic enough to easily fool a human observer. Detecting deepfakes is thus becoming important for reporters, social media platforms, and the general public. In this work, we introduce simple, yet surprisingly efficient digital forensic methods for audio spoof and visual deepfake detection. Our methods combine convolutional latent representations with bidirectional recurrent structures and entropy-based cost functions. The latent representations for both audio and video are carefully chosen to extract semantically rich information from the recordings. By feeding these into a recurrent framework, we can detect both spatial and temporal signatures of deepfake renditions. The entropy-based cost functions work well in isolation as well as in context with traditional cost functions. We demonstrate our methods on the FaceForensics++ and Celeb-DF video datasets and the ASVSpoof 2019 Logical Access audio datasets, achieving new benchmarks in all categories. We also perform extensive studies to demonstrate generalization to new domains and gain further insight into the effectiveness of the new architectures.
ISSN:	1932-4553 1941-0484
DOI:	10.1109/JSTSP.2020.2999185