AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attentio...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.10.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Fine-tuning large pre-trained language models on downstream tasks is apt to
suffer from overfitting when limited training data is available. While dropout
proves to be an effective antidote by randomly dropping a proportion of units,
existing research has not examined its effect on the self-attention mechanism.
In this paper, we investigate this problem through self-attention attribution
and find that dropping attention positions with low attribution scores can
accelerate training and increase the risk of overfitting. Motivated by this
observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly
discards some high-attribution positions to encourage the model to make
predictions by relying more on low-attribution positions to reduce overfitting.
We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to
avoid dropping high-attribution positions excessively. Extensive experiments on
various benchmarks show that AD-DROP yields consistent improvements over
baselines. Analysis further confirms that AD-DROP serves as a strategic
regularizer to prevent overfitting during fine-tuning. |
---|---|
DOI: | 10.48550/arxiv.2210.05883 |