Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechan...
Saved in:
Published in | 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6077 - 6086 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. |
---|---|
AbstractList | Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. |
Author | Johnson, Mark He, Xiaodong Anderson, Peter Zhang, Lei Buehler, Chris Teney, Damien Gould, Stephen |
Author_xml | – sequence: 1 givenname: Peter surname: Anderson fullname: Anderson, Peter – sequence: 2 givenname: Xiaodong surname: He fullname: He, Xiaodong – sequence: 3 givenname: Chris surname: Buehler fullname: Buehler, Chris – sequence: 4 givenname: Damien surname: Teney fullname: Teney, Damien – sequence: 5 givenname: Mark surname: Johnson fullname: Johnson, Mark – sequence: 6 givenname: Stephen surname: Gould fullname: Gould, Stephen – sequence: 7 givenname: Lei surname: Zhang fullname: Zhang, Lei |
BookMark | eNotj0tLw0AYRUdRsNasXbiZP5A478cyxqqFgq-22zJNvpRIMxMyKcV_b6yuLpx7uHCv0YUPHhC6pSSjlNj7Yv32kTFCTUaI4uoMJVYbKrlRSjBiz9GEjjxVltorlMT4RQhhynAj5AR9PoRhCG266rDzFV6GLn0MR4_zYQA_NMHjOvR43rod4MJ1v6Txu5O7buLB7fH7AeJJzH08Qj-2N-iydvsIyX9O0epptixe0sXr87zIF2nDBB1SB7pisqocd1IzZrWuKaNEQ7U1QjBuXAlMaVGb8Q-vtVQGSrUtmRBVWTPOp-jub7cBgE3XN63rvzdGaqO54D9gslHm |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR.2018.00636 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 9781538664209 1538664208 |
EISSN | 1063-6919 |
EndPage | 6086 |
ExternalDocumentID | 8578734 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:52:16 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233 |
PageCount | 10 |
ParticipantIDs | ieee_primary_8578734 |
PublicationCentury | 2000 |
PublicationDate | 2018-06 |
PublicationDateYYYYMMDD | 2018-06-01 |
PublicationDate_xml | – month: 06 year: 2018 text: 2018-06 |
PublicationDecade | 2010 |
PublicationTitle | 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2018 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0002683845 ssj0003211698 |
Score | 2.6301444 |
Snippet | Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 6077 |
SubjectTerms | Context modeling Mathematical model Object detection Proposals Servers Task analysis Visualization |
Title | Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering |
URI | https://ieeexplore.ieee.org/document/8578734 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTp6mbuJvcvBoN5u0SXqc0zGFydBt7DbSJIWhtsWlCP71vrR1injw1pZQQl773vdevvcFoQtfMaplGHk84tIlKMqLmU68hAkqjQLEG7hu5MkDG8-D-2W4bKDLbS-MMaYkn5meuyz38nWmClcq6wv3edGgiZqQuFW9Wtt6CoGXi3qHzN1TyGxYJGo1H_8q6g8X00fH5XLkSeYkmX8cp1JGk1EbTb7mUZFInnuFjXvq45dE438nuou63317eLqNSHuoYdJ91K6BJq5_400HPV1n1oJznOdYphrPsty7gWwcD6yt2I8YoCy-ewVfg4cyr2u25djFelPIF1zWSd3AQbp5L9UMu2g-up0Nx159uoK3hqhtPWm4JqHWksqQEwI4MHFiYtzoWAQAsoRUhjAeJMJVhxIeMgG2ixXgAa0SQukBaqVZag4RVoChSCL90BhwCCEsO5GBoYE2fgz-gR-hjlujVV4JaKzq5Tn--_EJ2nFWqvhYp6hl3wpzBpHfxuelyT8B_6-shg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_DoQNq13Y6IElAgRIF4I13bJUTdFtli4l_v6zbRGA_etqVZmr7tve-9fu8rQhdtxamWzHeEL6RNUJQTcB06IfeoNAoQr2u7kUdj3p-5d0_sqYIu170wxpicfGaa9jLfy9exymyprOXZz4u6G2gT4j4jRbfWuqJC4PVeuUdm7ynkNtz3Sj2f9pXf6s4nD5bNZemT3Ioy_zhQJY8nvRoafc2koJE8N7M0aKqPXyKN_53qDmp8d-7hyTom7aKKifZQrYSauPyRV3X0eB2nKbjHWYJlpPE0TpwbyMdxJ00L_iMGMIsHr-BtcFcmZdU2HztfrjL5gvNKqR3YiVbvuZ5hA816t9Nu3ynPV3CWELdTRxqhCdNaUskEIYAEQysnJowOPBdglieVIVy4oWfrQ6Fg3APrBQoQgVYhoXQfVaM4MgcIK0BRJJRtZgy4BAbLTqRrqKtNOwAPIQ5R3a7RIikkNBbl8hz9_fgcbfWno-FiOBjfH6Nta7GCnXWCqulbZk4BB6TBWW7-TzNmr9A |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Bottom-Up+and+Top-Down+Attention+for+Image+Captioning+and+Visual+Question+Answering&rft.au=Anderson%2C+Peter&rft.au=He%2C+Xiaodong&rft.au=Buehler%2C+Chris&rft.au=Teney%2C+Damien&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=6077&rft.epage=6086&rft_id=info:doi/10.1109%2FCVPR.2018.00636&rft.externalDocID=8578734 |