Watermark Smoothing Attacks against Language Models
Watermarking is a technique used to embed a hidden signal in the probability distribution of text generated by large language models (LLMs), enabling attribution of the text to the originating model. We introduce smoothing attacks and show that existing watermarking methods are not robust against mi...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.07.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Watermarking is a technique used to embed a hidden signal in the probability
distribution of text generated by large language models (LLMs), enabling
attribution of the text to the originating model. We introduce smoothing
attacks and show that existing watermarking methods are not robust against
minor modifications of text. An adversary can use weaker language models to
smooth out the distribution perturbations caused by watermarks without
significantly compromising the quality of the generated text. The modified text
resulting from the smoothing attack remains close to the distribution of text
that the original model (without watermark) would have produced. Our attack
reveals a fundamental limitation of a wide range of watermarking techniques. |
---|---|
DOI: | 10.48550/arxiv.2407.14206 |