AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attentio...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.10.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Fine-tuning large pre-trained language models on downstream tasks is apt to
suffer from overfitting when limited training data is available. While dropout
proves to be an effective antidote by randomly dropping a proportion of units,
existing research has not examined its effect on the self-attention mechanism.
In this paper, we investigate this problem through self-attention attribution
and find that dropping attention positions with low attribution scores can
accelerate training and increase the risk of overfitting. Motivated by this
observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly
discards some high-attribution positions to encourage the model to make
predictions by relying more on low-attribution positions to reduce overfitting.
We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to
avoid dropping high-attribution positions excessively. Extensive experiments on
various benchmarks show that AD-DROP yields consistent improvements over
baselines. Analysis further confirms that AD-DROP serves as a strategic
regularizer to prevent overfitting during fine-tuning. |
---|---|
AbstractList | Fine-tuning large pre-trained language models on downstream tasks is apt to
suffer from overfitting when limited training data is available. While dropout
proves to be an effective antidote by randomly dropping a proportion of units,
existing research has not examined its effect on the self-attention mechanism.
In this paper, we investigate this problem through self-attention attribution
and find that dropping attention positions with low attribution scores can
accelerate training and increase the risk of overfitting. Motivated by this
observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly
discards some high-attribution positions to encourage the model to make
predictions by relying more on low-attribution positions to reduce overfitting.
We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to
avoid dropping high-attribution positions excessively. Extensive experiments on
various benchmarks show that AD-DROP yields consistent improvements over
baselines. Analysis further confirms that AD-DROP serves as a strategic
regularizer to prevent overfitting during fine-tuning. |
Author | Nie, Shaoliang Yang, Tao Wang, Qifan Deng, Jinghao Quan, Xiaojun |
Author_xml | – sequence: 1 givenname: Tao surname: Yang fullname: Yang, Tao – sequence: 2 givenname: Jinghao surname: Deng fullname: Deng, Jinghao – sequence: 3 givenname: Xiaojun surname: Quan fullname: Quan, Xiaojun – sequence: 4 givenname: Qifan surname: Wang fullname: Wang, Qifan – sequence: 5 givenname: Shaoliang surname: Nie fullname: Nie, Shaoliang |
BackLink | https://doi.org/10.48550/arXiv.2210.05883$$DView paper in arXiv |
BookMark | eNotz01PwyAAxnEOepjTD7CTfAEm463grVmdW1Izs_Te0AINyYSFwaLfXp2enuR_eJLfHbgJMVgAFiu8ZJJz_KTTp78sCfkJmEtJZ2BXN6g57N-fYZ1z8kPJPgbUJH-xATYpnmLJ0MUED3Eo5wxbHaaiJwvforFHuPHBoq4EH6Z7cOv08Wwf_ncOus1Lt96idv-6W9ct0qKiSCrCraSWYCHIOBIxMlENbKWUGQURijqnpbCUVIZXAkvjmDKMKzdiIyTWdA4e_26vlv6U_IdOX_2vqb-a6DeQikcI |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by-nc-sa/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by-nc-sa/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2210.05883 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2210_05883 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a673-8925e83e20662cc26c467b4199dc62693ffa86e327d57608df49d459fc0d680a3 |
IEDL.DBID | GOX |
IngestDate | Mon Jan 08 05:42:10 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a673-8925e83e20662cc26c467b4199dc62693ffa86e327d57608df49d459fc0d680a3 |
OpenAccessLink | https://arxiv.org/abs/2210.05883 |
ParticipantIDs | arxiv_primary_2210_05883 |
PublicationCentury | 2000 |
PublicationDate | 2022-10-11 |
PublicationDateYYYYMMDD | 2022-10-11 |
PublicationDate_xml | – month: 10 year: 2022 text: 2022-10-11 day: 11 |
PublicationDecade | 2020 |
PublicationYear | 2022 |
Score | 1.8572553 |
SecondaryResourceType | preprint |
Snippet | Fine-tuning large pre-trained language models on downstream tasks is apt to
suffer from overfitting when limited training data is available. While dropout... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computation and Language |
Title | AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning |
URI | https://arxiv.org/abs/2210.05883 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIolKf5OA1uM1rk96Ka62iVkqF3kqeIEgr7a74801218fFazKXmZCZL5OZbwAuFfEi89RjbR3FzDAZ75ynWIZEBWPy4G3qRn58EpMXdr_giw6g714Yvfl8_Wj4gc32ipBUecWlpF3oEpJKtm6ni-ZzsqbiauV_5SLGrJf-BInxHuy26A6NmuPYh45fHcDdqMDFbPo8RKPyZ74ULjbJz6AiTSmoShSxI5qtTbUt0UObQkRpTtkbGkcciOdVyl8cwnx8M7-e4HaCAdYijworwr2kPlGmE2uJsNEtGTZQytn4kFA0BC2FpyR3EfZn0gWmHOMq2MwJmWl6BL3VeuX7gCyPN1FT4XMaY35wmkojFJe5FIwHS4-hX-u9fG9IKpbJJMvaJCf_b53CDknl_KlCY3AGvXJT-fMYZEtzUVv6C42EesU |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=AD-DROP%3A+Attribution-Driven+Dropout+for+Robust+Language+Model+Fine-Tuning&rft.au=Yang%2C+Tao&rft.au=Deng%2C+Jinghao&rft.au=Quan%2C+Xiaojun&rft.au=Wang%2C+Qifan&rft.date=2022-10-11&rft_id=info:doi/10.48550%2Farxiv.2210.05883&rft.externalDocID=2210_05883 |