Replay spoofing countermeasures using high spectro-temporal resolution features

The easy implementation of replay attacks by a fraudster poses a severe threat to automatic speaker verification (ASV) technology than the other spoofing attacks like speech synthesis and voice conversion. Replay attacks refer to an attack by a fraudster to get illegitimate access to an ASV system b...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of speech technology Vol. 22; no. 1; pp. 271 - 281
Main Authors Alluri, K. N. R. K. Raju, Vuppala, Anil Kumar
Format Journal Article
LanguageEnglish
Published New York Springer US 15.03.2019
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The easy implementation of replay attacks by a fraudster poses a severe threat to automatic speaker verification (ASV) technology than the other spoofing attacks like speech synthesis and voice conversion. Replay attacks refer to an attack by a fraudster to get illegitimate access to an ASV system by playing back the speech sample collected from genuine target speaker. The significant cues that can differentiate between genuine and replay recordings are channel characteristics. To capture these characteristics, one need to extract features from the spectrum, which should have high spectral and temporal resolutions. Zero time windowing (ZTW) analysis of speech is one such time-frequency analysis technique, which results in high spectral and temporal resolution spectrum at each sampling instant. In this study, new features are proposed by applying cepstral analysis to ZTW spectrum. Experiments are performed on two publicly available replay attack databases namely BTAS 2016 and ASVspoof 2017. The first set of experiments are conducted using Gaussian mixture models to evaluate the potential of proposed features. Performance of the proposed system in terms of half total error rate is 0.75% and in terms of equal error rate is 14.75% on BTAS 2016 and ASVspoof 2017 evaluation sets respectively. A score level fusion is performed by using proposed features with previously proposed single frequency filtering cepstral coefficients. This fused result outperformed the previously reported best results on these two datasets.
ISSN:1381-2416
1572-8110
DOI:10.1007/s10772-019-09602-z