Visual-Semantic Matching by Exploring High-Order Attention and Distraction
Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order...
Saved in:
Published in | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 12783 - 12792 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction. |
---|---|
AbstractList | Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction. |
Author | Mu, Yadong Zhang, Duo Li, Yongzhi |
Author_xml | – sequence: 1 givenname: Yongzhi surname: Li fullname: Li, Yongzhi organization: Center for Data Science – sequence: 2 givenname: Duo surname: Zhang fullname: Zhang, Duo organization: EECS – sequence: 3 givenname: Yadong surname: Mu fullname: Mu, Yadong organization: Wangxuan Institute of Computer Technology, Peking University |
BookMark | eNotjF1LwzAYRqMoOGd_gV70D6S-b5KmyeWom1MmEz92O9J8bJGuHW0F9-_d0IvDw4GHc00umrbxhNwhZIig78vV65tgEiBjwCADZArOSKILhQU7glLl52SEIDmVGvUVSfr-CwA4Q5RajcjzKvbfpqbvfmeaIdr0xQx2G5tNWh3S6c--bruTzONmS5ed8106GQZ_fLZNahqXPsR-6Iw9-Q25DKbuffK_Y_I5m36Uc7pYPj6VkwWNDPhAWZDoQBR5zqVxObdMGnDGSu_QMQuVc1IJhTbYIEKFuQeoQiWEEkVw2vAxuf3rRu_9et_FnekOa415ITnwX63gUNk |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR42600.2020.01280 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 9781728171685 1728171687 |
EISSN | 1063-6919 |
EndPage | 12792 |
ExternalDocumentID | 9157630 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:30:35 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3 |
PageCount | 10 |
ParticipantIDs | ieee_primary_9157630 |
PublicationCentury | 2000 |
PublicationDate | 2020-Jun |
PublicationDateYYYYMMDD | 2020-06-01 |
PublicationDate_xml | – month: 06 year: 2020 text: 2020-Jun |
PublicationDecade | 2020 |
PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2020 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0003211698 |
Score | 2.2536917 |
Snippet | Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 12783 |
SubjectTerms | Computational modeling Computer vision Image edge detection Proposals Semantics Task analysis Visualization |
Title | Visual-Semantic Matching by Exploring High-Order Attention and Distraction |
URI | https://ieeexplore.ieee.org/document/9157630 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhx6tiJG4-oUKFKhQpo1a3yp1QBKaLJAL8e2wlBIAa2yEuss3J37_LeMwDnNGNZpgYpUpJblLgiglzfnCGd4cxoKrXAXo08uWU3s2S8SBctcNFoYYwxgXxmIv8Y_uXrtSr9qKzPY9cdUwfQtxxwq7RazTyFOiTDeFar42LM-8P59D74rzsUSHDkMzH-cYdKKCGjDph8vbxijjxFZSEj9fHLl_G_u9sBvW-xHpw2ZWgXtEy-Bzp1dwnrb3fTBeP5alOKZ_RgXlw0VwpOXBb28yco32FDxYOe-IHuvCEnvCyKigwJRa7hVbDYDTKIHpiNrh-HN6i-SQGtCKYFIpbFGieDNKVM6JQqwgTWQjGjY00Ulloz76uvrLKJlXFqMJZWOuiWDKzmgu6Ddr7OzQGANrVYJsE6jCaSCs4JloQPhOApMYwcgq4PzfK1MstY1lE5-nv5GGz7w6m4VyegXbyV5tRV-UKeheP9BEgLp1w |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEG0IHvTkBxi_7cGjhW67LdujQQkiIFEg3Eg_E6KCkeWgv952d8VoPHjb9LLNNDtvZva9VwAuaMKTRDcY0ko4FHsQQb5uTpBJcGINVUbioEbu9Xl7FHcmbFICl2stjLU2I5_ZWnjM_uWbhV6FUVldRL46pr5B3_C4z6JcrbWeqFDfy3CRFPq4CIt6czx4yBzYfR9IcC3kYvzjFpUMRFrboPf1-pw78lRbpaqmP345M_53fzug-i3Xg4M1EO2Ckp3vge2ivoTF17usgM54tlzJZ_RoX3w8Zxr2fB4OEyio3uGajAcD9QPdB0tOeJWmOR0SyrmB15nJbiaEqIJR62bYbKPiLgU0I5imiDgeGRw3GKNcGkY14RIbqbk1kSEaK2N4cNbXTrvYqYhZjJVTvnmLG84ISfdBeb6Y2wMAHXNYxZl5GI0VlUIQrIhoSCkYsZwcgkoIzfQ1t8uYFlE5-nv5HGy2h73utHvbvzsGW-GgcibWCSinbyt76jE_VWfZUX8CM-WqpQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Visual-Semantic+Matching+by+Exploring+High-Order+Attention+and+Distraction&rft.au=Li%2C+Yongzhi&rft.au=Zhang%2C+Duo&rft.au=Mu%2C+Yadong&rft.date=2020-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12783&rft.epage=12792&rft_id=info:doi/10.1109%2FCVPR42600.2020.01280&rft.externalDocID=9157630 |