MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have ac...
Saved in:
Published in | 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) pp. 1507 - 1515 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.03.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space. |
---|---|
AbstractList | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space. |
Author | Han, Chihye Yoon, Wonjun Kim, Daeshik Park, Geondo |
Author_xml | – sequence: 1 givenname: Geondo surname: Park fullname: Park, Geondo organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 2 givenname: Chihye surname: Han fullname: Han, Chihye organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 3 givenname: Daeshik surname: Kim fullname: Kim, Daeshik organization: Korea Advanced Institute of Science and Technology (KAIST) – sequence: 4 givenname: Wonjun surname: Yoon fullname: Yoon, Wonjun organization: Lunit Inc |
BookMark | eNotj8tKw0AYhUdRsK0-gSDzAhPnn0sy4y6E1ghtXVTrsswkf2Q0F0lSim9vwK7O4nycjzMnV23XIiEPwCMAbh8_0myvtE5EJLjgkeVWamUuyBwSYSDWWtpLMhOxEsxKAzdkPgxfnEsLVs7IapPv0u0T3RzrMbAcXUl3WFcsHUdsx9C1dIvjqeu_adX1dB-Go6snonFTWdBl47EsQ_t5S64rVw94d84FeV8t37KcrV-fX7J0zYLQMDK0Gjza2E92jaoAL7wzoiqsr1yhXZF4ELJ0Ep0xXjnk3IrSmQpAWZ0kckHu_3cDIh5--tC4_vdw_iz_AKcsTWw |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/WACV45572.2020.9093548 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 1728165539 9781728165530 |
EISSN | 2642-9381 |
EndPage | 1515 |
ExternalDocumentID | 9093548 |
Genre | orig-research |
GroupedDBID | 29G 29O 6IE 6IF 6IK 6IL 6IM 6IN AAJGR ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI JC5 M43 OCL RIE RIL RNS |
ID | FETCH-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773 |
IEDL.DBID | RIE |
IngestDate | Wed Jun 26 19:28:57 EDT 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i251t-e951be96b3915e4c1b2ba82fc9bfac5ac7b123da3ea88b4ae0092da8f11495773 |
OpenAccessLink | http://arxiv.org/pdf/2001.03712 |
PageCount | 9 |
ParticipantIDs | ieee_primary_9093548 |
PublicationCentury | 2000 |
PublicationDate | 2020-March |
PublicationDateYYYYMMDD | 2020-03-01 |
PublicationDate_xml | – month: 03 year: 2020 text: 2020-March |
PublicationDecade | 2020 |
PublicationTitle | 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) |
PublicationTitleAbbrev | WACV |
PublicationYear | 2020 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0039193 |
Score | 2.2175083 |
Snippet | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1507 |
SubjectTerms | Feature extraction Image coding Image representation Recurrent neural networks Semantics Task analysis Visualization |
Title | MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding |
URI | https://ieeexplore.ieee.org/document/9093548 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFH4BTp5Qwfg7PXh0g27d1nojBEJMICYIciNt95oQEYxsF_96221gNB68NcuWLu-l-76u3_cewJ1rYRExJj0Uym5Q4lR4MrajmEvDVUhDU_RPGU_i0Yw9LqJFDe4PXhhELMRn6LthcZafbnXufpV1hDu1Y7wOdd4NSq_W_qsbCstEKgcw7YrOS68_Z1GUOK9V0PWrJ3-0UCkQZNiE8X7uUjjy6ueZ8vXnr7KM_325Y2h_e_XI0wGFTqCGm1NoVuSSVEt314LheDTtTR5I4bj1Rja1ZIpr4_WyrFQ8kkmpCCeWxpL5apfLtb3jzUZ-pcngTWHqJmjDbDh47o-8qomCt7LUJbM5iKhCEStXCR6ZpipQkgdGC2WkjqROlAWvVIYoOVdMoqvClEpuqNs7JUl4Bo3NdoPnQCyimcBIahlbyFImLHWhIdMBZVQajPQFtFxYlu9lnYxlFZHLvy9fwZFLTannuoZG9pHjjQX4TN0Wmf0CYxmkWw |
link.rule.ids | 310,311,786,790,795,796,802,23958,23959,25170,27958,55109 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3JTsMwELVKOcCpQIvY8YEjSevESWNuVdUqQBMhdaG3ynZsqaILosmFr2ecpEUgDtysLHI0o-Q9x-_NIHRnWlh4lHJLMQELFD9hFvdh5AdcB8Ilrs77p0SxH47p09SbVtD9zgujlMrFZ8o2w3wvP1nLzPwqazKza0eDPbQPON9ihVtr-911GXCR0gMMJ5uvne6Eel7buK2cll3e-6OJSo4h_RqKtrMX0pE3O0uFLT9_FWb87-Mdoca3Ww-_7HDoGFXU6gTVSnqJy5d3U0f9KBx24gece26tEJKLh2qhrU6aFppHHBeacAxEFk_mm4wv4IolxH4ucW8pVGImaKBxvzfqhlbZRsGaA3lJIQseEYr5wtSCV1QS4QgeOFoyobn0uGwLgK-Eu4oHgaBcmTpMCQ80Maundts9RdXVeqXOEAZM047mBDibSxPKgLwQl0qHUMK18uQ5qpuwzN6LShmzMiIXfx--RQfhKBrMBo_x8yU6NGkq1F1XqJp-ZOoa4D4VN3mWvwBBy6ex |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=2020+IEEE+Winter+Conference+on+Applications+of+Computer+Vision+%28WACV%29&rft.atitle=MHSAN%3A+Multi-Head+Self-Attention+Network+for+Visual+Semantic+Embedding&rft.au=Park%2C+Geondo&rft.au=Han%2C+Chihye&rft.au=Kim%2C+Daeshik&rft.au=Yoon%2C+Wonjun&rft.date=2020-03-01&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=1507&rft.epage=1515&rft_id=info:doi/10.1109%2FWACV45572.2020.9093548&rft.externalDocID=9093548 |