Approximating How Single Head Attention Learns

Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. La...

Full description

Saved in:

Bibliographic Details
Main Authors	Snell, Charlie, Zhong, Ruiqi, Klein, Dan, Steinhardt, Jacob
Format	Journal Article
Language	English
Published	12.03.2021
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online Access	Get full text

Cover

Loading…

Abstract	Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.
AbstractList	Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word `i` to `o` if they co-occur frequently. Later, the model learns to attend to `i` while the correct output is $o$ because it knows `i` translates to `o`. To formalize, we define a model property, Knowledge to Translate Individual Words (KTIW) (e.g. knowing that `i` translates to `o`), and claim that it drives the learning of the attention. This claim is supported by the fact that before the attention mechanism is learned, KTIW can be learned from word co-occurrence statistics, but not the other way around. Particularly, we can construct a training distribution that makes KTIW hard to learn, the learning of the attention fails, and the model cannot even learn the simple task of copying the input words to the output. Our approximation explains why models sometimes attend to salient words, and inspires a toy example where a multi-head attention model can overcome the above hard training distribution by improving learning dynamics rather than expressiveness. We end by discussing the limitation of our approximation framework and suggest future directions.
Author	Steinhardt, Jacob Klein, Dan Snell, Charlie Zhong, Ruiqi
Author_xml	– sequence: 1 givenname: Charlie surname: Snell fullname: Snell, Charlie – sequence: 2 givenname: Ruiqi surname: Zhong fullname: Zhong, Ruiqi – sequence: 3 givenname: Dan surname: Klein fullname: Klein, Dan – sequence: 4 givenname: Jacob surname: Steinhardt fullname: Steinhardt, Jacob
BackLink	https://doi.org/10.48550/arXiv.2103.07601$$DView paper in arXiv
BookMark	eNotzrFuwjAUhWEPdICUB2DCL5DUjm1sj1EETaVIHcoeXfA1igRO5EQQ3h5KO51_OvoWZBa6gISsOMukUYp9QJzaa5ZzJjKmN4zPSVb0feym9gJjG0606m705xlnpBWCo8U4YhjbLtAaIYbhnbx5OA-4_N-E7HfbfVml9ffnV1nUKWw0T52TwL21Vh8gz41l4PAg0SM3aIURTnjBuDbco9JglVQeLNN4zJU6SoMiIeu_2xe46ePTF-_NL7x5wcUDBD0-yA
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2103.07601
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2103_07601
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a671-dd4a1f9997ba22890adeb4efe18e9383d3f301781fe57a9545fa907ec255c48e3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:48:52 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a671-dd4a1f9997ba22890adeb4efe18e9383d3f301781fe57a9545fa907ec255c48e3
OpenAccessLink	https://arxiv.org/abs/2103.07601
ParticipantIDs	arxiv_primary_2103_07601
PublicationCentury	2000
PublicationDate	2021-03-12
PublicationDateYYYYMMDD	2021-03-12
PublicationDate_xml	– month: 03 year: 2021 text: 2021-03-12 day: 12
PublicationDecade	2020
PublicationYear	2021
Score	1.8037419
SecondaryResourceType	preprint
Snippet	Why do models often attend to salient words, and how does this evolve throughout training? We approximate model training as a two stage process: early on in...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Title	Approximating How Single Head Attention Learns
URI	https://arxiv.org/abs/2103.07601
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwED2VTiwIBKh8ygOrS_3RuBkjRImQgKVI2SI7PqNKqKA2QH8-ZycIFlb7ljvb8nu-8zuAqyAs7QuUXOLEcB2QjpTPFFdOWD811ob0mPPwmJXP-r6aVgNgP39h7Hq7_Oz0gd3mmviIGsfcEfGbHSljydbdU9UlJ5MUV2__a0cYMw39uSTm-7DXoztWdMtxAANcHcK4iLrd22XEhqsXVr59sdjk_hVZSQFmRdt2JYcsaZ1ujmAxv13clLxvU8BtZgT3XlsRCGcZZ2VM21mPTmNAMcOc-J9XgQ6RmYmA5HtOiCVYYqTYEJhvKEzqGIbE9HEEzGDmgpNEYRqnA3EXN7E56sYIJBihshMYJefq906Joo5-18nv0_-nzmBXxkKMWIQmz2HYrj_wgm7S1l2mcH4D4PVywA
link.rule.ids	228,230,786,891
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Approximating+How+Single+Head+Attention+Learns&rft.au=Snell%2C+Charlie&rft.au=Zhong%2C+Ruiqi&rft.au=Klein%2C+Dan&rft.au=Steinhardt%2C+Jacob&rft.date=2021-03-12&rft_id=info:doi/10.48550%2Farxiv.2103.07601&rft.externalDocID=2103_07601