Game of Trojans: Adaptive Adversaries Against Output-based Trojaned-Model Detectors

We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an o...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Sahabandu, Dinuka, Xu, Xiaojun, Rajabi, Arezoo, Niu, Luyao, Ramasubramanian, Bhaskar, Li, Bo, Poovendran, Radha
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 12.02.2024
Subjects	Detectors Embedding Entropy (Information theory) Evolution Games Greedy algorithms Iterative methods Mathematical models Parameters Sensors
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We propose and analyze an adaptive adversary that can retrain a Trojaned DNN and is also aware of SOTA output-based Trojaned model detectors. We show that such an adversary can ensure (1) high accuracy on both trigger-embedded and clean samples and (2) bypass detection. Our approach is based on an observation that the high dimensionality of the DNN parameters provides sufficient degrees of freedom to simultaneously achieve these objectives. We also enable SOTA detectors to be adaptive by allowing retraining to recalibrate their parameters, thus modeling a co-evolution of parameters of a Trojaned model and detectors. We then show that this co-evolution can be modeled as an iterative game, and prove that the resulting (optimal) solution of this interactive game leads to the adversary successfully achieving the above objectives. In addition, we provide a greedy algorithm for the adversary to select a minimum number of input samples for embedding triggers. We show that for cross-entropy or log-likelihood loss functions used by the DNNs, the greedy algorithm provides provable guarantees on the needed number of trigger-embedded input samples. Extensive experiments on four diverse datasets -- MNIST, CIFAR-10, CIFAR-100, and SpeechCommand -- reveal that the adversary effectively evades four SOTA output-based Trojaned model detectors: MNTD, NeuralCleanse, STRIP, and TABOR.
ISSN:	2331-8422