Visual-Semantic Matching by Exploring High-Order Attention and Distraction

Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 12783 - 12792
Main Authors	Li, Yongzhi, Zhang, Duo, Mu, Yadong
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2020
Subjects	Computational modeling Computer vision Image edge detection Proposals Semantics Task analysis Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
AbstractList	Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore object-based alignment between image objects and text words. In this work, we address this task from two previously-ignored aspects: high-order semantic information (e.g., object-predicate-subject triplet, object-attribute pair) and visual distraction (i.e., despite the high relevance to textual query, images may also contain many prominent distracting objects or visual relations). Specifically, we build scene graphs for both visual and textual modalities. Our technical contributions are two-folds: firstly, we formulate the visual-semantic matching task as an attention-driven cross-modality scene graph matching problem. Graph convolutional networks (GCNs) are used to extract high-order information from two scene graphs. A novel cross-graph attention mechanism is proposed to contextually reweigh graph elements and calculate the inter-graph similarity; Secondly, some top-ranked samples are indeed false matching due to the co-occurrence of both highly-relevant and distracting information. We devise an information-theoretic measure for estimating semantic distraction and re-ranking the initial retrieval results. Comprehensive experiments and ablation studies on two large public datasets (MS-COCO and Flickr30K) demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction.
Author	Mu, Yadong Zhang, Duo Li, Yongzhi
Author_xml	– sequence: 1 givenname: Yongzhi surname: Li fullname: Li, Yongzhi organization: Center for Data Science – sequence: 2 givenname: Duo surname: Zhang fullname: Zhang, Duo organization: EECS – sequence: 3 givenname: Yadong surname: Mu fullname: Mu, Yadong organization: Wangxuan Institute of Computer Technology, Peking University
BookMark	eNotjF1LwzAYRqMoOGd_gV70D6S-b5KmyeWom1MmEz92O9J8bJGuHW0F9-_d0IvDw4GHc00umrbxhNwhZIig78vV65tgEiBjwCADZArOSKILhQU7glLl52SEIDmVGvUVSfr-CwA4Q5RajcjzKvbfpqbvfmeaIdr0xQx2G5tNWh3S6c--bruTzONmS5ed8106GQZ_fLZNahqXPsR-6Iw9-Q25DKbuffK_Y_I5m36Uc7pYPj6VkwWNDPhAWZDoQBR5zqVxObdMGnDGSu_QMQuVc1IJhTbYIEKFuQeoQiWEEkVw2vAxuf3rRu_9et_FnekOa415ITnwX63gUNk
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR42600.2020.01280
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9781728171685 1728171687
EISSN	1063-6919
EndPage	12792
ExternalDocumentID	9157630
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:30:35 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-2f61d0475536ad53c26a0dac6ed1d2c0bdd68481cfcf4fb15e00bfb44847fd9a3
PageCount	10
ParticipantIDs	ieee_primary_9157630
PublicationCentury	2000
PublicationDate	2020-Jun
PublicationDateYYYYMMDD	2020-06-01
PublicationDate_xml	– month: 06 year: 2020 text: 2020-Jun
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2020
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.2536917
Snippet	Cross-modality semantic matching is a vital task in computer vision and has attracted increasing attention in recent years. Existing methods mainly explore...
SourceID	ieee
SourceType	Publisher
StartPage	12783
SubjectTerms	Computational modeling Computer vision Image edge detection Proposals Semantics Task analysis Visualization
Title	Visual-Semantic Matching by Exploring High-Order Attention and Distraction
URI	https://ieeexplore.ieee.org/document/9157630
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhx6tiJG4-oUKFKhQpo1a3yp1QBKaLJAL8e2wlBIAa2yEuss3J37_LeMwDnNGNZpgYpUpJblLgiglzfnCGd4cxoKrXAXo08uWU3s2S8SBctcNFoYYwxgXxmIv8Y_uXrtSr9qKzPY9cdUwfQtxxwq7RazTyFOiTDeFar42LM-8P59D74rzsUSHDkMzH-cYdKKCGjDph8vbxijjxFZSEj9fHLl_G_u9sBvW-xHpw2ZWgXtEy-Bzp1dwnrb3fTBeP5alOKZ_RgXlw0VwpOXBb28yco32FDxYOe-IHuvCEnvCyKigwJRa7hVbDYDTKIHpiNrh-HN6i-SQGtCKYFIpbFGieDNKVM6JQqwgTWQjGjY00Ulloz76uvrLKJlXFqMJZWOuiWDKzmgu6Ddr7OzQGANrVYJsE6jCaSCs4JloQPhOApMYwcgq4PzfK1MstY1lE5-nv5GGz7w6m4VyegXbyV5tRV-UKeheP9BEgLp1w
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTwIxEG0IHvTkBxi_7cGjhW67LdujQQkiIFEg3Eg_E6KCkeWgv952d8VoPHjb9LLNNDtvZva9VwAuaMKTRDcY0ko4FHsQQb5uTpBJcGINVUbioEbu9Xl7FHcmbFICl2stjLU2I5_ZWnjM_uWbhV6FUVldRL46pr5B3_C4z6JcrbWeqFDfy3CRFPq4CIt6czx4yBzYfR9IcC3kYvzjFpUMRFrboPf1-pw78lRbpaqmP345M_53fzug-i3Xg4M1EO2Ckp3vge2ivoTF17usgM54tlzJZ_RoX3w8Zxr2fB4OEyio3uGajAcD9QPdB0tOeJWmOR0SyrmB15nJbiaEqIJR62bYbKPiLgU0I5imiDgeGRw3GKNcGkY14RIbqbk1kSEaK2N4cNbXTrvYqYhZjJVTvnmLG84ISfdBeb6Y2wMAHXNYxZl5GI0VlUIQrIhoSCkYsZwcgkoIzfQ1t8uYFlE5-nv5HGy2h73utHvbvzsGW-GgcibWCSinbyt76jE_VWfZUX8CM-WqpQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Visual-Semantic+Matching+by+Exploring+High-Order+Attention+and+Distraction&rft.au=Li%2C+Yongzhi&rft.au=Zhang%2C+Duo&rft.au=Mu%2C+Yadong&rft.date=2020-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=12783&rft.epage=12792&rft_id=info:doi/10.1109%2FCVPR42600.2020.01280&rft.externalDocID=9157630