The Role of Long-Term Dependency in Synthetic Speech Detection

Although much progress has been made in synthetic speech detection, there lacks comprehensive analysis of the essential differences between spoofed and genuine speech. We here utilize supervised contrastive loss originated from contrastive learning as an analytical tool to characterize the class sim...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 29; pp. 1142 - 1146
Main Authors	Li, Changtao, Yang, Feiran, Yang, Jun
Format	Journal Article
Language	English
Published	New York IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	ASVspoof 2019 LA Classifiers Coders Convolution Cost function Datasets Feature extraction generalization ability Learning Principal component analysis speaker verification Speech Speech synthesis Training transformer Transformers Voice activity detection voice anti-spoofing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Although much progress has been made in synthetic speech detection, there lacks comprehensive analysis of the essential differences between spoofed and genuine speech. We here utilize supervised contrastive loss originated from contrastive learning as an analytical tool to characterize the class similarity structure of ASVspoof 2019 logical access (LA) dataset, which shows that an ideal back-end classifier for synthetic speech detection should have the ability to capture long-term dependencies. Recently, Transformer has been found to have an excellent ability in learning long-term dependencies of input data. We hence propose a back-end classifier based on Transformer Encoder for synthetic speech detection. Convolution blocks are added before the Transformer Encoder, which leverages inductive biases to improve the generalization ability. Compared to two-dimensional convolution, one-dimensional convolution makes better architectural assumptions about the input speech features, which helps with modeling long-term dependencies and decreases the risk of overfitting. The proposed Transformer combined with one-dimensional convolution has fewer parameters than most existing back-end classifiers, and achieves an equal error rate of 1.06% and a minimum tandem detection cost function metric of 0.0345 when evaluated on ASVspoof 2019 LA dataset, which is one of the best models reported in the literature.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2022.3169954