Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. La...

Full description

Saved in:
Bibliographic Details
Main Authors Snell, Charlie, Zhong, Ruiqi, Klein, Dan, Steinhardt, Jacob
Format Journal Article
LanguageEnglish
Published 12.03.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.
AbstractList Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.
Author Steinhardt, Jacob
Klein, Dan
Snell, Charlie
Zhong, Ruiqi
Author_xml – sequence: 1
  givenname: Charlie
  surname: Snell
  fullname: Snell, Charlie
– sequence: 2
  givenname: Ruiqi
  surname: Zhong
  fullname: Zhong, Ruiqi
– sequence: 3
  givenname: Dan
  surname: Klein
  fullname: Klein, Dan
– sequence: 4
  givenname: Jacob
  surname: Steinhardt
  fullname: Steinhardt, Jacob
BackLink https://doi.org/10.48550/arXiv.2103.07601$$DView paper in arXiv
BookMark eNotzrFuwjAUhWEPdICUB2DCL5DUjm1sj1EETaVIHcoeXfA1igRO5EQQ3h5KO51_OvoWZBa6gISsOMukUYp9QJzaa5ZzJjKmN4zPSVb0feym9gJjG0606m705xlnpBWCo8U4YhjbLtAaIYbhnbx5OA-4_N-E7HfbfVml9ffnV1nUKWw0T52TwL21Vh8gz41l4PAg0SM3aIURTnjBuDbco9JglVQeLNN4zJU6SoMiIeu_2xe46ePTF-_NL7x5wcUDBD0-yA
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2103.07601
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2103_07601
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a671-dd4a1f9997ba22890adeb4efe18e9383d3f301781fe57a9545fa907ec255c48e3
IEDL.DBID GOX
IngestDate Mon Jan 08 05:48:52 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a671-dd4a1f9997ba22890adeb4efe18e9383d3f301781fe57a9545fa907ec255c48e3
OpenAccessLink https://arxiv.org/abs/2103.07601
ParticipantIDs arxiv_primary_2103_07601
PublicationCentury 2000
PublicationDate 2021-03-12
PublicationDateYYYYMMDD 2021-03-12
PublicationDate_xml – month: 03
  year: 2021
  text: 2021-03-12
  day: 12
PublicationDecade 2020
PublicationYear 2021
Score 1.8037419
SecondaryResourceType preprint
Snippet Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Title Approximating How Single Head Attention Learns
URI https://arxiv.org/abs/2103.07601
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED2VTiwIBKh8ygOrS_3RuBkjRImQgKVI2SI7PqNKqKA2QH8-ZycIFlb7ljvb8nu-8zuAqyAs7QuUXOLEcB2QjpTPFFdOWD811ob0mPPwmJXP-r6aVgNgP39h7Hq7_Oz0gd3mmviIGsfcEfGbHSljydbdU9UlJ5MUV2__a0cYMw39uSTm-7DXoztWdMtxAANcHcK4iLrd22XEhqsXVr59sdjk_hVZSQFmRdt2JYcsaZ1ujmAxv13clLxvU8BtZgT3XlsRCGcZZ2VM21mPTmNAMcOc-J9XgQ6RmYmA5HtOiCVYYqTYEJhvKEzqGIbE9HEEzGDmgpNEYRqnA3EXN7E56sYIJBihshMYJefq906Joo5-18nv0_-nzmBXxkKMWIQmz2HYrj_wgm7S1l2mcH4D4PVywA
link.rule.ids 228,230,786,891
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Approximating+How+Single+Head+Attention+Learns&rft.au=Snell%2C+Charlie&rft.au=Zhong%2C+Ruiqi&rft.au=Klein%2C+Dan&rft.au=Steinhardt%2C+Jacob&rft.date=2021-03-12&rft_id=info:doi/10.48550%2Farxiv.2103.07601&rft.externalDocID=2103_07601