MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have ac...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
11.01.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space. |
---|---|
AbstractList | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful visual-semantic embedding is to express visual and textual data properly by accounting for their intricate relationship. While previous studies have achieved much advance by encoding the visual and textual data into a joint space where similar concepts are closely located, they often represent data by a single vector ignoring the presence of multiple important components in an image or text. Thus, in addition to the joint embedding space, we propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data. Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets. Through the visualization of the attention maps that capture distinct semantic components at multiple positions in the image and the text, we demonstrate that our method achieves an effective and interpretable visual-semantic joint space. |
Author | Han, Chihye Kim, Daeshik Park, Geondo Yoon, Wonjun |
Author_xml | – sequence: 1 givenname: Geondo surname: Park fullname: Park, Geondo – sequence: 2 givenname: Chihye surname: Han fullname: Han, Chihye – sequence: 3 givenname: Wonjun surname: Yoon fullname: Yoon, Wonjun – sequence: 4 givenname: Daeshik surname: Kim fullname: Kim, Daeshik |
BookMark | eNqNjrsOgjAAABujiaj8QxPnJtgKJW7EQFhgwbiSKsUUS6t9xN-3gx_gdMPdcBuwVFrxBYgwIQeUHzFeg9jaKUkSnFGcpiQCVVN3RXuCjZdOoJqzAXZcjqhwjisntIItdx9tnnDUBl6F9UyGYmZB3mE53_gwCPXYgdXIpOXxj1uwr8rLuUYvo9-eW9dP2hsVVB9maJYnlFLyX_UFY5I77Q |
ContentType | Paper |
Copyright | 2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central ProQuest Central Essentials AUTh Library subscriptions: ProQuest Central Technology Collection ProQuest One Community College ProQuest Central SciTech Premium Collection ProQuest Engineering Collection Engineering Database Publicly Available Content Database ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
ID | FETCH-proquest_journals_23376807773 |
IEDL.DBID | 8FG |
IngestDate | Tue Sep 24 23:27:58 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-proquest_journals_23376807773 |
OpenAccessLink | https://www.proquest.com/docview/2337680777/abstract/?pq-origsite=%requestingapplication% |
PQID | 2337680777 |
PQPubID | 2050157 |
ParticipantIDs | proquest_journals_2337680777 |
PublicationCentury | 2000 |
PublicationDate | 20200111 |
PublicationDateYYYYMMDD | 2020-01-11 |
PublicationDate_xml | – month: 01 year: 2020 text: 20200111 day: 11 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2020 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 3.2523625 |
SecondaryResourceType | preprint |
Snippet | Visual-semantic embedding enables various tasks such as image-text retrieval, image captioning, and visual question answering. The key to successful... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Embedding Retrieval Semantics |
Title | MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding |
URI | https://www.proquest.com/docview/2337680777/abstract/ |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NS8MwFH_MFcGbn_gxR0CvoU1T--FFprQWoWU4ld1G0qQw2OZcu6t_u3ndqgdhx5CQkEfyvn_vAdwyrqLQLzQVKtLUU0JTGZUuVawMUcNQjodA4Sz303fvZXw37kDaYmEwrbLliQ2jVp8F-shtl5uvEDpBENhCohegqO2H5RfF_lEYZ90209gDi2FNPMSMJ8-_3hbXD4zuzP8x3EaKJIdgDcVSr46goxfHsN8kXxbVCSRZOhrk96QBw9LUUJ2M9Kykg7reJCOSfJOsTYyGST6m1VrMzIq5Icq0IPFcaoUS6BRukvjtKaXt4ZPtQ6kmf9fiZ9A1Fr8-BxJgtMqRsmAao4KhFJx5mhtVJcTqhewCert2utw9fQUHLhqNDqOM9aBbr9b62kjWWvYbovXBeozz4asZZd_xD0TFgrE |
link.rule.ids | 786,790,12792,21416,33408,33779,43635,43840 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dS8MwED-0RfTNT_yYGtDXYNN0_fBFprRUXctwU_ZWmuYKg03n2v3_Jl2nD8KeExJyJHe_u_tdDuCWcRn4boE0lwFSR-ZIRVDaVLLS1whDWo4uFE5SN353XsbdcRtwq1pa5VonNopafhU6Rn5nc_UUfMvzvIf5N9Vdo3R2tW2hsQ2mw5WrYoD5GKaDt98oi-16CjPzf4q2sR7RPpiDfI6LA9jCz0PYaUiXRXUEURIPe-k9aYpgaaykTYY4LWmvrlckRJKuSNpEIUvyMamW-VTNmClhTAoSzgRKbXmO4SYKR08xXW-etRekyv6Ow0_AUJ4-ngLxdJbKEqJgqLOBvsg5c5AriOLrXwvZGXQ2rXS-efgaduNR0s_6z-nrBezZ2nG0GGWsA0a9WOKlsq61uGpF-AMpSIB6 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MHSAN%3A+Multi-Head+Self-Attention+Network+for+Visual+Semantic+Embedding&rft.jtitle=arXiv.org&rft.au=Park%2C+Geondo&rft.au=Han%2C+Chihye&rft.au=Yoon%2C+Wonjun&rft.au=Kim%2C+Daeshik&rft.date=2020-01-11&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422 |