Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechan...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6077 - 6086
Main Authors	Anderson, Peter, He, Xiaodong, Buehler, Chris, Teney, Damien, Johnson, Mark, Gould, Stephen, Zhang, Lei
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2018
Subjects	Context modeling Mathematical model Object detection Proposals Servers Task analysis Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
AbstractList	Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
Author	Johnson, Mark He, Xiaodong Anderson, Peter Zhang, Lei Buehler, Chris Teney, Damien Gould, Stephen
Author_xml	– sequence: 1 givenname: Peter surname: Anderson fullname: Anderson, Peter – sequence: 2 givenname: Xiaodong surname: He fullname: He, Xiaodong – sequence: 3 givenname: Chris surname: Buehler fullname: Buehler, Chris – sequence: 4 givenname: Damien surname: Teney fullname: Teney, Damien – sequence: 5 givenname: Mark surname: Johnson fullname: Johnson, Mark – sequence: 6 givenname: Stephen surname: Gould fullname: Gould, Stephen – sequence: 7 givenname: Lei surname: Zhang fullname: Zhang, Lei
BookMark	eNotj0tLw0AYRUdRsNasXbiZP5A478cyxqqFgq-22zJNvpRIMxMyKcV_b6yuLpx7uHCv0YUPHhC6pSSjlNj7Yv32kTFCTUaI4uoMJVYbKrlRSjBiz9GEjjxVltorlMT4RQhhynAj5AR9PoRhCG266rDzFV6GLn0MR4_zYQA_NMHjOvR43rod4MJ1v6Txu5O7buLB7fH7AeJJzH08Qj-2N-iydvsIyX9O0epptixe0sXr87zIF2nDBB1SB7pisqocd1IzZrWuKaNEQ7U1QjBuXAlMaVGb8Q-vtVQGSrUtmRBVWTPOp-jub7cBgE3XN63rvzdGaqO54D9gslHm
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR.2018.00636
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9781538664209 1538664208
EISSN	1063-6919
EndPage	6086
ExternalDocumentID	8578734
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:52:16 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233
PageCount	10
ParticipantIDs	ieee_primary_8578734
PublicationCentury	2000
PublicationDate	2018-06
PublicationDateYYYYMMDD	2018-06-01
PublicationDate_xml	– month: 06 year: 2018 text: 2018-06
PublicationDecade	2010
PublicationTitle	2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublicationTitleAbbrev	CVPR
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0002683845 ssj0003211698
Score	2.6301444
Snippet	Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding...
SourceID	ieee
SourceType	Publisher
StartPage	6077
SubjectTerms	Context modeling Mathematical model Object detection Proposals Servers Task analysis Visualization
Title	Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
URI	https://ieeexplore.ieee.org/document/8578734
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTp6mbuJvcvBoN5u0SXqc0zGFydBt7DbSJIWhtsWlCP71vrR1injw1pZQQl773vdevvcFoQtfMaplGHk84tIlKMqLmU68hAkqjQLEG7hu5MkDG8-D-2W4bKDLbS-MMaYkn5meuyz38nWmClcq6wv3edGgiZqQuFW9Wtt6CoGXi3qHzN1TyGxYJGo1H_8q6g8X00fH5XLkSeYkmX8cp1JGk1EbTb7mUZFInnuFjXvq45dE438nuou63317eLqNSHuoYdJ91K6BJq5_400HPV1n1oJznOdYphrPsty7gWwcD6yt2I8YoCy-ewVfg4cyr2u25djFelPIF1zWSd3AQbp5L9UMu2g-up0Nx159uoK3hqhtPWm4JqHWksqQEwI4MHFiYtzoWAQAsoRUhjAeJMJVhxIeMgG2ixXgAa0SQukBaqVZag4RVoChSCL90BhwCCEsO5GBoYE2fgz-gR-hjlujVV4JaKzq5Tn--_EJ2nFWqvhYp6hl3wpzBpHfxuelyT8B_6-shg
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_DoQNq13Y6IElAgRIF4I13bJUTdFtli4l_v6zbRGA_etqVZmr7tve-9fu8rQhdtxamWzHeEL6RNUJQTcB06IfeoNAoQr2u7kUdj3p-5d0_sqYIu170wxpicfGaa9jLfy9exymyprOXZz4u6G2gT4j4jRbfWuqJC4PVeuUdm7ynkNtz3Sj2f9pXf6s4nD5bNZemT3Ioy_zhQJY8nvRoafc2koJE8N7M0aKqPXyKN_53qDmp8d-7hyTom7aKKifZQrYSauPyRV3X0eB2nKbjHWYJlpPE0TpwbyMdxJ00L_iMGMIsHr-BtcFcmZdU2HztfrjL5gvNKqR3YiVbvuZ5hA816t9Nu3ynPV3CWELdTRxqhCdNaUskEIYAEQysnJowOPBdglieVIVy4oWfrQ6Fg3APrBQoQgVYhoXQfVaM4MgcIK0BRJJRtZgy4BAbLTqRrqKtNOwAPIQ5R3a7RIikkNBbl8hz9_fgcbfWno-FiOBjfH6Nta7GCnXWCqulbZk4BB6TBWW7-TzNmr9A
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Bottom-Up+and+Top-Down+Attention+for+Image+Captioning+and+Visual+Question+Answering&rft.au=Anderson%2C+Peter&rft.au=He%2C+Xiaodong&rft.au=Buehler%2C+Chris&rft.au=Teney%2C+Damien&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=6077&rft.epage=6086&rft_id=info:doi/10.1109%2FCVPR.2018.00636&rft.externalDocID=8578734