MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding

Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have ac...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1507 - 1515
Main Authors	Park, Geondo, Han, Chihye, Kim, Daeshik, Yoon, Wonjun
Format	Conference Proceeding
Language	English
Published	IEEE 01.03.2020
Subjects	Feature extraction Image coding Image representation Recurrent neural networks Semantics Task analysis Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.
AbstractList	Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space.
Author	Han, Chihye Yoon, Wonjun Kim, Daeshik Park, Geondo
Author_xml	– sequence: 1 givenname: Geondo surname: Park fullname: Park, Geondo organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 2 givenname: Chihye surname: Han fullname: Han, Chihye organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 3 givenname: Daeshik surname: Kim fullname: Kim, Daeshik organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 4 givenname: Wonjun surname: Yoon fullname: Yoon, Wonjun organization: Lunit Inc
BookMark	eNotj8tKw0AYhUdRsK0-gSDzAhPnn0sy4y6E1ghtXVTrsswkf2Q0F0lSim9vwK7O4nycjzMnV23XIiEPwCMAbh8_0myvtE5EJLjgkeVWamUuyBwSYSDWWtpLMhOxEsxKAzdkPgxfnEsLVs7IapPv0u0T3RzrMbAcXUl3WFcsHUdsx9C1dIvjqeu_adX1dB-Go6snonFTWdBl47EsQ_t5S64rVw94d84FeV8t37KcrV-fX7J0zYLQMDK0Gjza2E92jaoAL7wzoiqsr1yhXZF4ELJ0Ep0xXjnk3IrSmQpAWZ0kckHu_3cDIh5--tC4_vdw_iz_AKcsTWw
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/WACV45572.2020.9093548
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	1728165539 9781728165530
EISSN	2642-9381
EndPage	1515
ExternalDocumentID	9093548
Genre	orig-research
GroupedDBID	29G 29O 6IE 6IF 6IK 6IL 6IM 6IN AAJGR ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI JC5 M43 OCL RIE RIL RNS
ID	FETCH-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773
IEDL.DBID	RIE
IngestDate	Wed Jun 26 19:28:57 EDT 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773
OpenAccessLink	http://arxiv.org/pdf/2001.03712
PageCount	9
ParticipantIDs	ieee_primary_9093548
PublicationCentury	2000
PublicationDate	2020-March
PublicationDateYYYYMMDD	2020-03-01
PublicationDate_xml	– month: 03 year: 2020 text: 2020-March
PublicationDecade	2020
PublicationTitle	2020 IEEE Winter Conference on Applications of Computer Vision (WACV)
PublicationTitleAbbrev	WACV
PublicationYear	2020
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0039193
Score	2.2175083
Snippet	Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful...
SourceID	ieee
SourceType	Publisher
StartPage	1507
SubjectTerms	Feature extraction Image coding Image representation Recurrent neural networks Semantics Task analysis Visualization
Title	MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
URI	https://ieeexplore.ieee.org/document/9093548
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4BTp5Qwfg7PXh0g27d1nojBEJMICYIciNt95oQEYxsF_96221gNB68NcuWLu-l-76u3_cewJ1rYRExJj0Uym5Q4lR4MrajmEvDVUhDU_RPGU_i0Yw9LqJFDe4PXhhELMRn6LthcZafbnXufpV1hDu1Y7wOdd4NSq_W_qsbCstEKgcw7YrOS68_Z1GUOK9V0PWrJ3-0UCkQZNiE8X7uUjjy6ueZ8vXnr7KM_325Y2h_e_XI0wGFTqCGm1NoVuSSVEt314LheDTtTR5I4bj1Rja1ZIpr4_WyrFQ8kkmpCCeWxpL5apfLtb3jzUZ-pcngTWHqJmjDbDh47o-8qomCt7LUJbM5iKhCEStXCR6ZpipQkgdGC2WkjqROlAWvVIYoOVdMoqvClEpuqNs7JUl4Bo3NdoPnQCyimcBIahlbyFImLHWhIdMBZVQajPQFtFxYlu9lnYxlFZHLvy9fwZFLTannuoZG9pHjjQX4TN0Wmf0CYxmkWw
link.rule.ids	310,311,786,790,795,796,802,23958,23959,25170,27958,55109
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKOcCpQIvY8YEjSevESWNuVdUqQBMhdaG3ynZsqaILosmFr2ecpEUgDtysLHI0o-Q9x-_NIHRnWlh4lHJLMQELFD9hFvdh5AdcB8Ilrs77p0SxH47p09SbVtD9zgujlMrFZ8o2w3wvP1nLzPwqazKza0eDPbQPON9ihVtr-911GXCR0gMMJ5uvne6Eel7buK2cll3e-6OJSo4h_RqKtrMX0pE3O0uFLT9_FWb87-Mdoca3Ww-_7HDoGFXU6gTVSnqJy5d3U0f9KBx24gece26tEJKLh2qhrU6aFppHHBeacAxEFk_mm4wv4IolxH4ucW8pVGImaKBxvzfqhlbZRsGaA3lJIQseEYr5wtSCV1QS4QgeOFoyobn0uGwLgK-Eu4oHgaBcmTpMCQ80Maundts9RdXVeqXOEAZM047mBDibSxPKgLwQl0qHUMK18uQ5qpuwzN6LShmzMiIXfx--RQfhKBrMBo_x8yU6NGkq1F1XqJp-ZOoa4D4VN3mWvwBBy6ex
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2020+IEEE+Winter+Conference+on+Applications+of+Computer+Vision+%28WACV%29&rft.atitle=MHSAN%3A+Multi-Head+Self-Attention+Network+for+Visual+Semantic+Embedding&rft.au=Park%2C+Geondo&rft.au=Han%2C+Chihye&rft.au=Kim%2C+Daeshik&rft.au=Yoon%2C+Wonjun&rft.date=2020-03-01&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=1507&rft.epage=1515&rft_id=info:doi/10.1109%2FWACV45572.2020.9093548&rft.externalDocID=9093548