Fake speech detection using VGGish with attention block

While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary object...

Full description

Saved in:

Bibliographic Details
Published in	EURASIP journal on audio, speech, and music processing Vol. 2024; no. 1; pp. 35 - 19
Main Authors	Kanwal, Tahira, Mahum, Rabbia, AlSalman, Abdul Malik, Sharaf, Mohamed, Hassan, Haseeb
Format	Journal Article
Language	English
Published	Cham Springer International Publishing 26.06.2024 Springer Nature B.V SpringerOpen
Subjects	Acoustics Algorithms ASVspoof Attention module Audio signals CBAM Deception Deep learning Deepfake Empirical Research Engineering Engineering Acoustics Machine learning Mathematics in Music Modules Signal,Image and Speech Processing Spectrograms Speech recognition Spoofing Synthetic audios Deep learning Attention module Synthetic audios VGGish CBAM Deepfake ASVspoof
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While deep learning technologies have made remarkable progress in generating deepfakes, their misuse has become a well-known concern. As a result, the ubiquitous usage of deepfakes for increasing false information poses significant risks to the security and privacy of individuals. The primary objective of audio spoofing detection is to identify audio generated through numerous AI-based techniques. Several techniques for fake audio detection already exist using machine learning algorithms. However, they lack generalization and may not identify all types of AI-synthesized audios such as replay attacks, voice conversion, and text-to-speech (TTS). In this paper, a deep layered model, i.e., VGGish, along with an attention block, namely Convolutional Block Attention Module (CBAM) for spoofing detection, is introduced. Our suggested model successfully classifies input audio into two classes: Fake and Real, converting them into mel-spectrograms, and extracting their most representative features due to the attention block. Our model is a significant technique to utilize for audio spoofing detection due to a simple layered architecture. It captures complex relationships in audio signals due to both spatial and channel features present in an attention module. To evaluate the effectiveness of our model, we have conducted in-depth testing using the ASVspoof 2019 dataset. The proposed technique achieved an EER of 0.52% for Physical Access (PA) attacks and 0.07 % for Logical Access (LA) attacks.
ISSN:	1687-4722 1687-4714 1687-4722
DOI:	10.1186/s13636-024-00348-4