Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 12783 - 12792
Main Authors Li, Yongzhi, Zhang, Duo, Mu, Yadong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
AbstractList Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
Author Mu, Yadong
Zhang, Duo
Li, Yongzhi
Author_xml – sequence: 1
  givenname: Yongzhi
  surname: Li
  fullname: Li, Yongzhi
  organization: Center for Data Science
– sequence: 2
  givenname: Duo
  surname: Zhang
  fullname: Zhang, Duo
  organization: EECS
– sequence: 3
  givenname: Yadong
  surname: Mu
  fullname: Mu, Yadong
  organization: Wangxuan Institute of Computer Technology, Peking University
BookMark eNotjF1LwzAYRqMoOGd_gV70D6S-b5KmyeWom1MmEz92O9J8bJGuHW0F9-_d0IvDw4GHc00umrbxhNwhZIig78vV65tgEiBjwCADZArOSKILhQU7glLl52SEIDmVGvUVSfr-CwA4Q5RajcjzKvbfpqbvfmeaIdr0xQx2G5tNWh3S6c--bruTzONmS5ed8106GQZ_fLZNahqXPsR-6Iw9-Q25DKbuffK_Y_I5m36Uc7pYPj6VkwWNDPhAWZDoQBR5zqVxObdMGnDGSu_QMQuVc1IJhTbYIEKFuQeoQiWEEkVw2vAxuf3rRu_9et_FnekOa415ITnwX63gUNk
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR42600.2020.01280
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781728171685
1728171687
EISSN 1063-6919
EndPage 12792
ExternalDocumentID 9157630
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:30:35 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3
PageCount 10
ParticipantIDs ieee_primary_9157630
PublicationCentury 2000
PublicationDate 2020-Jun
PublicationDateYYYYMMDD 2020-06-01
PublicationDate_xml – month: 06
  year: 2020
  text: 2020-Jun
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.2536917
Snippet Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore...
SourceID ieee
SourceType Publisher
StartPage 12783
SubjectTerms Computational modeling
Computer vision
Image edge detection
Proposals
Semantics
Task analysis
Visualization
Title Visual-Semantic Matching by Exploring High-Order Attention and Distraction
URI https://ieeexplore.ieee.org/document/9157630
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhx6tiJG4-oUKFKhQpo1a3yp1QBKaLJAL8e2wlBIAa2yEuss3J37_LeMwDnNGNZpgYpUpJblLgiglzfnCGd4cxoKrXAXo08uWU3s2S8SBctcNFoYYwxgXxmIv8Y_uXrtSr9qKzPY9cdUwfQtxxwq7RazTyFOiTDeFar42LM-8P59D74rzsUSHDkMzH-cYdKKCGjDph8vbxijjxFZSEj9fHLl_G_u9sBvW-xHpw2ZWgXtEy-Bzp1dwnrb3fTBeP5alOKZ_RgXlw0VwpOXBb28yco32FDxYOe-IHuvCEnvCyKigwJRa7hVbDYDTKIHpiNrh-HN6i-SQGtCKYFIpbFGieDNKVM6JQqwgTWQjGjY00Ulloz76uvrLKJlXFqMJZWOuiWDKzmgu6Ddr7OzQGANrVYJsE6jCaSCs4JloQPhOApMYwcgq4PzfK1MstY1lE5-nv5GGz7w6m4VyegXbyV5tRV-UKeheP9BEgLp1w
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEG0IHvTkBxi_7cGjhW67LdujQQkiIFEg3Eg_E6KCkeWgv952d8VoPHjb9LLNNDtvZva9VwAuaMKTRDcY0ko4FHsQQb5uTpBJcGINVUbioEbu9Xl7FHcmbFICl2stjLU2I5_ZWnjM_uWbhV6FUVldRL46pr5B3_C4z6JcrbWeqFDfy3CRFPq4CIt6czx4yBzYfR9IcC3kYvzjFpUMRFrboPf1-pw78lRbpaqmP345M_53fzug-i3Xg4M1EO2Ckp3vge2ivoTF17usgM54tlzJZ_RoX3w8Zxr2fB4OEyio3uGajAcD9QPdB0tOeJWmOR0SyrmB15nJbiaEqIJR62bYbKPiLgU0I5imiDgeGRw3GKNcGkY14RIbqbk1kSEaK2N4cNbXTrvYqYhZjJVTvnmLG84ISfdBeb6Y2wMAHXNYxZl5GI0VlUIQrIhoSCkYsZwcgkoIzfQ1t8uYFlE5-nv5HGy2h73utHvbvzsGW-GgcibWCSinbyt76jE_VWfZUX8CM-WqpQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Visual-Semantic+Matching+by+Exploring+High-Order+Attention+and+Distraction&rft.au=Li%2C+Yongzhi&rft.au=Zhang%2C+Duo&rft.au=Mu%2C+Yadong&rft.date=2020-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12783&rft.epage=12792&rft_id=info:doi/10.1109%2FCVPR42600.2020.01280&rft.externalDocID=9157630