AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning

Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attentio...

Full description

Saved in:
Bibliographic Details
Main Authors Yang, Tao, Deng, Jinghao, Quan, Xiaojun, Wang, Qifan, Nie, Shaoliang
Format Journal Article
LanguageEnglish
Published 11.10.2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting. Motivated by this observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly discards some high-attribution positions to encourage the model to make predictions by relying more on low-attribution positions to reduce overfitting. We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively. Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.
AbstractList Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout proves to be an effective antidote by randomly dropping a proportion of units, existing research has not examined its effect on the self-attention mechanism. In this paper, we investigate this problem through self-attention attribution and find that dropping attention positions with low attribution scores can accelerate training and increase the risk of overfitting. Motivated by this observation, we propose Attribution-Driven Dropout (AD-DROP), which randomly discards some high-attribution positions to encourage the model to make predictions by relying more on low-attribution positions to reduce overfitting. We also develop a cross-tuning strategy to alternate fine-tuning and AD-DROP to avoid dropping high-attribution positions excessively. Extensive experiments on various benchmarks show that AD-DROP yields consistent improvements over baselines. Analysis further confirms that AD-DROP serves as a strategic regularizer to prevent overfitting during fine-tuning.
Author Nie, Shaoliang
Yang, Tao
Wang, Qifan
Deng, Jinghao
Quan, Xiaojun
Author_xml – sequence: 1
  givenname: Tao
  surname: Yang
  fullname: Yang, Tao
– sequence: 2
  givenname: Jinghao
  surname: Deng
  fullname: Deng, Jinghao
– sequence: 3
  givenname: Xiaojun
  surname: Quan
  fullname: Quan, Xiaojun
– sequence: 4
  givenname: Qifan
  surname: Wang
  fullname: Wang, Qifan
– sequence: 5
  givenname: Shaoliang
  surname: Nie
  fullname: Nie, Shaoliang
BackLink https://doi.org/10.48550/arXiv.2210.05883$$DView paper in arXiv
BookMark eNotz01PwyAAxnEOepjTD7CTfAEm463grVmdW1Izs_Te0AINyYSFwaLfXp2enuR_eJLfHbgJMVgAFiu8ZJJz_KTTp78sCfkJmEtJZ2BXN6g57N-fYZ1z8kPJPgbUJH-xATYpnmLJ0MUED3Eo5wxbHaaiJwvforFHuPHBoq4EH6Z7cOv08Wwf_ncOus1Lt96idv-6W9ct0qKiSCrCraSWYCHIOBIxMlENbKWUGQURijqnpbCUVIZXAkvjmDKMKzdiIyTWdA4e_26vlv6U_IdOX_2vqb-a6DeQikcI
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by-nc-sa/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by-nc-sa/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2210.05883
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2210_05883
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a673-8925e83e20662cc26c467b4199dc62693ffa86e327d57608df49d459fc0d680a3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:42:10 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a673-8925e83e20662cc26c467b4199dc62693ffa86e327d57608df49d459fc0d680a3
OpenAccessLink https://arxiv.org/abs/2210.05883
ParticipantIDs arxiv_primary_2210_05883
PublicationCentury 2000
PublicationDate 2022-10-11
PublicationDateYYYYMMDD 2022-10-11
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-10-11
  day: 11
PublicationDecade 2020
PublicationYear 2022
Score 1.8572553
SecondaryResourceType preprint
Snippet Fine-tuning large pre-trained language models on downstream tasks is apt to suffer from overfitting when limited training data is available. While dropout...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computation and Language
Title AD-DROP: Attribution-Driven Dropout for Robust Language Model Fine-Tuning
URI https://arxiv.org/abs/2210.05883
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIolKf5OA1uM1rk96Ka62iVkqF3kqeIEgr7a74801218fFazKXmZCZL5OZbwAuFfEi89RjbR3FzDAZ75ynWIZEBWPy4G3qRn58EpMXdr_giw6g714Yvfl8_Wj4gc32ipBUecWlpF3oEpJKtm6ni-ZzsqbiauV_5SLGrJf-BInxHuy26A6NmuPYh45fHcDdqMDFbPo8RKPyZ74ULjbJz6AiTSmoShSxI5qtTbUt0UObQkRpTtkbGkcciOdVyl8cwnx8M7-e4HaCAdYijworwr2kPlGmE2uJsNEtGTZQytn4kFA0BC2FpyR3EfZn0gWmHOMq2MwJmWl6BL3VeuX7gCyPN1FT4XMaY35wmkojFJe5FIwHS4-hX-u9fG9IKpbJJMvaJCf_b53CDknl_KlCY3AGvXJT-fMYZEtzUVv6C42EesU
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=AD-DROP%3A+Attribution-Driven+Dropout+for+Robust+Language+Model+Fine-Tuning&rft.au=Yang%2C+Tao&rft.au=Deng%2C+Jinghao&rft.au=Quan%2C+Xiaojun&rft.au=Wang%2C+Qifan&rft.date=2022-10-11&rft_id=info:doi/10.48550%2Farxiv.2210.05883&rft.externalDocID=2210_05883