Interpretable Spectrum Transformation Attacks to Speaker Recognition Systems

The success of adversarial attacks on speaker recognition is mainly in white-box scenarios. When applying the adversarial voices that are generated by attacking white-box surrogate models to black-box victim models, i.e. transfer-based black-box attacks, the transferability of the adversarial voices...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 32; pp. 1531 - 1545
Main Authors	Yao, Jiadi, Luo, Hong, Qi, Jun, Zhang, Xiao-Lei
Format	Journal Article
Language	English
Published	Piscataway IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	adversarial examples adversarial transferability Black boxes black-box attacks Closed box Data models Discrete cosine transform Frequencies Frequency domain analysis Glass box Optimization Perturbation methods Speaker recognition Speech recognition Time-frequency analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The success of adversarial attacks on speaker recognition is mainly in white-box scenarios. When applying the adversarial voices that are generated by attacking white-box surrogate models to black-box victim models, i.e. transfer-based black-box attacks, the transferability of the adversarial voices is not only far from satisfactory, but also lacks interpretable basis. To address these issues, in this article, we propose a general framework, named spectral transformation attack based on modified discrete cosine transform (STA-MDCT), to improve the transferability of the adversarial voices to a black-box victim model. Specifically, we first apply MDCT to the input voice. Then, we slightly modify the energy of different frequency bands for capturing the salient regions of the adversarial noise in the time-frequency domain that are critical to a successful attack. Unlike existing approaches that operate voices in the time domain, the proposed framework operates voices in the time-frequency domain, which improves the interpretability, transferability, and imperceptibility of the attack. Moreover, it can be implemented with any gradient-based attackers. To utilize the advantage of model ensembling, we not only implement STA-MDCT with a single white-box surrogate model but also with an ensemble of surrogate models. Finally, we visualize the saliency maps of adversarial voices by the class activation maps (CAM), which offer an interpretable basis for transfer-based attacks in speaker recognition for the first time. Extensive comparison results with six representative attackers show that the CAM visualization clearly explains the effectiveness of STA-MDCT and the weaknesses of the comparison methods; the proposed method outperforms the comparison methods by a large margin. Our audio samples are available on the demo website. 1
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2024.3364100