MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have ac...

Full description

Saved in:
Bibliographic Details
Published in2020 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1507 - 1515
Main Authors Park, Geondo, Han, Chihye, Kim, Daeshik, Yoon, Wonjun
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.03.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.
AbstractList Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.
Author Han, Chihye
Yoon, Wonjun
Kim, Daeshik
Park, Geondo
Author_xml – sequence: 1
  givenname: Geondo
  surname: Park
  fullname: Park, Geondo
  organization: Korea Advanced Institute of Science and Technology (KAIST)
– sequence: 2
  givenname: Chihye
  surname: Han
  fullname: Han, Chihye
  organization: Korea Advanced Institute of Science and Technology (KAIST)
– sequence: 3
  givenname: Daeshik
  surname: Kim
  fullname: Kim, Daeshik
  organization: Korea Advanced Institute of Science and Technology (KAIST)
– sequence: 4
  givenname: Wonjun
  surname: Yoon
  fullname: Yoon, Wonjun
  organization: Lunit Inc
BookMark eNotj8tKw0AYhUdRsK0-gSDzAhPnn0sy4y6E1ghtXVTrsswkf2Q0F0lSim9vwK7O4nycjzMnV23XIiEPwCMAbh8_0myvtE5EJLjgkeVWamUuyBwSYSDWWtpLMhOxEsxKAzdkPgxfnEsLVs7IapPv0u0T3RzrMbAcXUl3WFcsHUdsx9C1dIvjqeu_adX1dB-Go6snonFTWdBl47EsQ_t5S64rVw94d84FeV8t37KcrV-fX7J0zYLQMDK0Gjza2E92jaoAL7wzoiqsr1yhXZF4ELJ0Ep0xXjnk3IrSmQpAWZ0kckHu_3cDIh5--tC4_vdw_iz_AKcsTWw
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/WACV45572.2020.9093548
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1728165539
9781728165530
EISSN 2642-9381
EndPage 1515
ExternalDocumentID 9093548
Genre orig-research
GroupedDBID 29G
29O
6IE
6IF
6IK
6IL
6IM
6IN
AAJGR
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
JC5
M43
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773
IEDL.DBID RIE
IngestDate Wed Jun 26 19:28:57 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773
OpenAccessLink http://arxiv.org/pdf/2001.03712
PageCount 9
ParticipantIDs ieee_primary_9093548
PublicationCentury 2000
PublicationDate 2020-March
PublicationDateYYYYMMDD 2020-03-01
PublicationDate_xml – month: 03
  year: 2020
  text: 2020-March
PublicationDecade 2020
PublicationTitle 2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
PublicationTitleAbbrev WACV
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0039193
Score 2.2175083
Snippet Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful...
SourceID ieee
SourceType Publisher
StartPage 1507
SubjectTerms Feature extraction
Image coding
Image representation
Recurrent neural networks
Semantics
Task analysis
Visualization
Title MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
URI https://ieeexplore.ieee.org/document/9093548
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4BTp5Qwfg7PXh0g27d1nojBEJMICYIciNt95oQEYxsF_96221gNB68NcuWLu-l-76u3_cewJ1rYRExJj0Uym5Q4lR4MrajmEvDVUhDU_RPGU_i0Yw9LqJFDe4PXhhELMRn6LthcZafbnXufpV1hDu1Y7wOdd4NSq_W_qsbCstEKgcw7YrOS68_Z1GUOK9V0PWrJ3-0UCkQZNiE8X7uUjjy6ueZ8vXnr7KM_325Y2h_e_XI0wGFTqCGm1NoVuSSVEt314LheDTtTR5I4bj1Rja1ZIpr4_WyrFQ8kkmpCCeWxpL5apfLtb3jzUZ-pcngTWHqJmjDbDh47o-8qomCt7LUJbM5iKhCEStXCR6ZpipQkgdGC2WkjqROlAWvVIYoOVdMoqvClEpuqNs7JUl4Bo3NdoPnQCyimcBIahlbyFImLHWhIdMBZVQajPQFtFxYlu9lnYxlFZHLvy9fwZFLTannuoZG9pHjjQX4TN0Wmf0CYxmkWw
link.rule.ids 310,311,786,790,795,796,802,23958,23959,25170,27958,55109
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKOcCpQIvY8YEjSevESWNuVdUqQBMhdaG3ynZsqaILosmFr2ecpEUgDtysLHI0o-Q9x-_NIHRnWlh4lHJLMQELFD9hFvdh5AdcB8Ilrs77p0SxH47p09SbVtD9zgujlMrFZ8o2w3wvP1nLzPwqazKza0eDPbQPON9ihVtr-911GXCR0gMMJ5uvne6Eel7buK2cll3e-6OJSo4h_RqKtrMX0pE3O0uFLT9_FWb87-Mdoca3Ww-_7HDoGFXU6gTVSnqJy5d3U0f9KBx24gece26tEJKLh2qhrU6aFppHHBeacAxEFk_mm4wv4IolxH4ucW8pVGImaKBxvzfqhlbZRsGaA3lJIQseEYr5wtSCV1QS4QgeOFoyobn0uGwLgK-Eu4oHgaBcmTpMCQ80Maundts9RdXVeqXOEAZM047mBDibSxPKgLwQl0qHUMK18uQ5qpuwzN6LShmzMiIXfx--RQfhKBrMBo_x8yU6NGkq1F1XqJp-ZOoa4D4VN3mWvwBBy6ex
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2020+IEEE+Winter+Conference+on+Applications+of+Computer+Vision+%28WACV%29&rft.atitle=MHSAN%3A+Multi-Head+Self-Attention+Network+for+Visual+Semantic+Embedding&rft.au=Park%2C+Geondo&rft.au=Han%2C+Chihye&rft.au=Kim%2C+Daeshik&rft.au=Yoon%2C+Wonjun&rft.date=2020-03-01&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=1507&rft.epage=1515&rft_id=info:doi/10.1109%2FWACV45572.2020.9093548&rft.externalDocID=9093548