Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Transactions on Machine Learning Research (2024) Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbati...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Transactions on Machine Learning Research (2024) Real-world natural language processing systems need to be robust to human
adversaries. Collecting examples of human adversaries for training is an
effective but expensive solution. On the other hand, training on synthetic
attacks with small perturbations - such as word-substitution - does not
actually improve robustness to human adversaries. In this paper, we propose an
adversarial training framework that uses limited human adversarial examples to
generate more useful adversarial examples at scale. We demonstrate the
advantages of this system on the ANLI and hate speech detection benchmark
datasets - both collected via an iterative, adversarial
human-and-model-in-the-loop procedure. Compared to training only on observed
human attacks, also training on our synthetic adversarial examples improves
model robustness to future rounds. In ANLI, we see accuracy gains on the
current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of
human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate
speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a
future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the
distribution of existing human adversaries, meanwhile, degrade robustness. |
---|---|
DOI: | 10.48550/arxiv.2310.16955 |