Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechan...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6077 - 6086
Main Authors Anderson, Peter, He, Xiaodong, Buehler, Chris, Teney, Damien, Johnson, Mark, Gould, Stephen, Zhang, Lei
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2018
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
AbstractList Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
Author Johnson, Mark
He, Xiaodong
Anderson, Peter
Zhang, Lei
Buehler, Chris
Teney, Damien
Gould, Stephen
Author_xml – sequence: 1
  givenname: Peter
  surname: Anderson
  fullname: Anderson, Peter
– sequence: 2
  givenname: Xiaodong
  surname: He
  fullname: He, Xiaodong
– sequence: 3
  givenname: Chris
  surname: Buehler
  fullname: Buehler, Chris
– sequence: 4
  givenname: Damien
  surname: Teney
  fullname: Teney, Damien
– sequence: 5
  givenname: Mark
  surname: Johnson
  fullname: Johnson, Mark
– sequence: 6
  givenname: Stephen
  surname: Gould
  fullname: Gould, Stephen
– sequence: 7
  givenname: Lei
  surname: Zhang
  fullname: Zhang, Lei
BookMark eNotj0tLw0AYRUdRsNasXbiZP5A478cyxqqFgq-22zJNvpRIMxMyKcV_b6yuLpx7uHCv0YUPHhC6pSSjlNj7Yv32kTFCTUaI4uoMJVYbKrlRSjBiz9GEjjxVltorlMT4RQhhynAj5AR9PoRhCG266rDzFV6GLn0MR4_zYQA_NMHjOvR43rod4MJ1v6Txu5O7buLB7fH7AeJJzH08Qj-2N-iydvsIyX9O0epptixe0sXr87zIF2nDBB1SB7pisqocd1IzZrWuKaNEQ7U1QjBuXAlMaVGb8Q-vtVQGSrUtmRBVWTPOp-jub7cBgE3XN63rvzdGaqO54D9gslHm
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR.2018.00636
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781538664209
1538664208
EISSN 1063-6919
EndPage 6086
ExternalDocumentID 8578734
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233
IEDL.DBID RIE
IngestDate Wed Aug 27 02:52:16 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-ae7d25dda3a5722977f12107edb844238ace2674f89783f7568ec6bc244dcf233
PageCount 10
ParticipantIDs ieee_primary_8578734
PublicationCentury 2000
PublicationDate 2018-06
PublicationDateYYYYMMDD 2018-06-01
PublicationDate_xml – month: 06
  year: 2018
  text: 2018-06
PublicationDecade 2010
PublicationTitle 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublicationTitleAbbrev CVPR
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002683845
ssj0003211698
Score 2.6301444
Snippet Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding...
SourceID ieee
SourceType Publisher
StartPage 6077
SubjectTerms Context modeling
Mathematical model
Object detection
Proposals
Servers
Task analysis
Visualization
Title Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
URI https://ieeexplore.ieee.org/document/8578734
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA7bTp6mbuJvcvBoN5u0SXqc0zGFydBt7DbSJIWhtsWlCP71vrR1injw1pZQQl773vdevvcFoQtfMaplGHk84tIlKMqLmU68hAkqjQLEG7hu5MkDG8-D-2W4bKDLbS-MMaYkn5meuyz38nWmClcq6wv3edGgiZqQuFW9Wtt6CoGXi3qHzN1TyGxYJGo1H_8q6g8X00fH5XLkSeYkmX8cp1JGk1EbTb7mUZFInnuFjXvq45dE438nuou63317eLqNSHuoYdJ91K6BJq5_400HPV1n1oJznOdYphrPsty7gWwcD6yt2I8YoCy-ewVfg4cyr2u25djFelPIF1zWSd3AQbp5L9UMu2g-up0Nx159uoK3hqhtPWm4JqHWksqQEwI4MHFiYtzoWAQAsoRUhjAeJMJVhxIeMgG2ixXgAa0SQukBaqVZag4RVoChSCL90BhwCCEsO5GBoYE2fgz-gR-hjlujVV4JaKzq5Tn--_EJ2nFWqvhYp6hl3wpzBpHfxuelyT8B_6-shg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_DoQNq13Y6IElAgRIF4I13bJUTdFtli4l_v6zbRGA_etqVZmr7tve-9fu8rQhdtxamWzHeEL6RNUJQTcB06IfeoNAoQr2u7kUdj3p-5d0_sqYIu170wxpicfGaa9jLfy9exymyprOXZz4u6G2gT4j4jRbfWuqJC4PVeuUdm7ynkNtz3Sj2f9pXf6s4nD5bNZemT3Ioy_zhQJY8nvRoafc2koJE8N7M0aKqPXyKN_53qDmp8d-7hyTom7aKKifZQrYSauPyRV3X0eB2nKbjHWYJlpPE0TpwbyMdxJ00L_iMGMIsHr-BtcFcmZdU2HztfrjL5gvNKqR3YiVbvuZ5hA816t9Nu3ynPV3CWELdTRxqhCdNaUskEIYAEQysnJowOPBdglieVIVy4oWfrQ6Fg3APrBQoQgVYhoXQfVaM4MgcIK0BRJJRtZgy4BAbLTqRrqKtNOwAPIQ5R3a7RIikkNBbl8hz9_fgcbfWno-FiOBjfH6Nta7GCnXWCqulbZk4BB6TBWW7-TzNmr9A
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Bottom-Up+and+Top-Down+Attention+for+Image+Captioning+and+Visual+Question+Answering&rft.au=Anderson%2C+Peter&rft.au=He%2C+Xiaodong&rft.au=Buehler%2C+Chris&rft.au=Teney%2C+Damien&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=6077&rft.epage=6086&rft_id=info:doi/10.1109%2FCVPR.2018.00636&rft.externalDocID=8578734