ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignore...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11018 - 11027
Main Authors	Chen, Yuxin, Ma, Zongyang, Zhang, Ziqi, Qi, Zhongang, Yuan, Chunfeng, Shan, Ying, Li, Bing, HU, Weiming, Qie, Xiaohu, Wu, JianPing
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2023
Subjects	and reasoning Computational modeling Computer architecture Computer vision Data models language Pattern recognition Semantics Vision Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.
AbstractList	Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.
Author	HU, Weiming Shan, Ying Wu, JianPing Chen, Yuxin Qi, Zhongang Zhang, Ziqi Yuan, Chunfeng Qie, Xiaohu Ma, Zongyang Li, Bing
Author_xml	– sequence: 1 givenname: Yuxin surname: Chen fullname: Chen, Yuxin email: chenyuxin2019@ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 2 givenname: Zongyang surname: Ma fullname: Ma, Zongyang email: mazongyang2020@ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 3 givenname: Ziqi surname: Zhang fullname: Zhang, Ziqi email: ziqi.zhang@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 4 givenname: Zhongang surname: Qi fullname: Qi, Zhongang email: zhongangqi@tencent.com organization: ARC Lab – sequence: 5 givenname: Chunfeng surname: Yuan fullname: Yuan, Chunfeng email: cfyuan@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 6 givenname: Ying surname: Shan fullname: Shan, Ying email: yingsshan@tencent.com organization: ARC Lab – sequence: 7 givenname: Bing surname: Li fullname: Li, Bing email: bli@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 8 givenname: Weiming surname: HU fullname: HU, Weiming email: wmhu@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 9 givenname: Xiaohu surname: Qie fullname: Qie, Xiaohu email: tigerqie@tencent.com organization: Tencent PCG – sequence: 10 givenname: JianPing surname: Wu fullname: Wu, JianPing email: jianping@cernet.edu.cn organization: Tsinghua University
BookMark	eNotj91KhFAURk9R0DT5BnPhC2j7_HnO7iIIsRpwKIbJ2-GoWzngaKgT9fYJdfWxWLDgu2VX_dATYxsOMeeA92nxvtfCCIwFCBkDhwQuWIAGrdQggQu0l2wltNGRAaNvWDBNvgQtAIxEu2KPhc-z3UNY-Onsuih3fXt2LYXZOA5juBtq6nzfhs0C29MiogN9z-Ge5tHTl-vu2HXjuomC_12zj-fskL5G-dvLNn3KIy9AzZElh4YQFSZlmSiBJZQCeV1rCSiJFFRGa1uqynLOlVA1t4lUJmnAqUpbuWabv64nouPn6E9u_DlyELC80PIX6EdKBQ
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52729.2023.01060
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	9798350301298
EISSN	2575-7075
EndPage	11027
ExternalDocumentID	10203985
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583
IEDL.DBID	RIE
IngestDate	Wed Jun 26 19:26:17 EDT 2024
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583
PageCount	10
ParticipantIDs	ieee_primary_10203985
PublicationCentury	2000
PublicationDate	2023-June
PublicationDateYYYYMMDD	2023-06-01
PublicationDate_xml	– month: 06 year: 2023 text: 2023-June
PublicationDecade	2020
PublicationTitle	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev	CVPR
PublicationYear	2023
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssib052007398 ssib042469789
Score	2.289945
Snippet	Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image...
SourceID	ieee
SourceType	Publisher
StartPage	11018
SubjectTerms	and reasoning Computational modeling Computer architecture Computer vision Data models language Pattern recognition Semantics Vision Visualization
Title	ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
URI	https://ieeexplore.ieee.org/document/10203985
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZoJyZAFPGWB9YEx7ETm4GlalVQW1VVW3WrYseRKqBFIVn49dwlKS8JiS3KQ7J1du6-8333EXIDJtbgSTMAORmmbkTsmRg2njY8kRhfiBS5w6NxNJiLx6VcNmT1igvjnKuKz5yPl9VZfrq1JabKYIdzFmolW6SlGK_JWrvFIzgAvW-t07GdUAwvN3S5gOnb7mIylRyiSR81w31EQ-yHqErlU_oHZLwbTV1K8uSXhfHt-69Gjf8e7iHpfNH36OTTMR2RPbc5JveL9bA3uqOL9VuZPHvDJk9Je3m-zSlKoiExnUIMSx9e4IE3g982nVaCW7AaO2Te7826A68RT_DWnInCUy7RsdNgisgYABnaMMN1kKYSzz6dE8zGUiojrAqQTyvSQEWhiKOMJcJKFZ6Q9ma7caeEKkA9Fj6OMpsIF0baWoCNGQA9wZk08ox0cPKr17o_xmo37_M_7l-QfTRAXXB1SdpFXrorcO2Fua5M-gE1SZ39
link.rule.ids	310,311,786,790,795,796,802,27956,55107
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELagDDABoog3HlgTHMdObAaWqlULaVVVbdWtih1HqoAGhWTh13NOUl4SEluUh2Tr7Hz3ne-7Q-gGTCwBSVMgOakN3bDQUSFsPKlozK1_wRKrHR6Ogv6MPSz4ohGrV1oYY0yVfGZce1md5SeZLm2oDHY4Jb4UfBvtANCTsJZrbZYPo0D1vhVPtwWFQni9Ecx5RN525uMJp-BPurZruGv5EPnRVqVCld4-Gm3GUyeTPLlloVz9_qtU478HfIDaXwI-PP6EpkO0ZdZH6H6-irrDOzxfvZXxsxM1kUrczfMsx7YpmpWmY_Bi8eAFHjhT-HHjSdVyC9ZjG8163Wmn7zTtE5wVJaxwhIllaCQYI1AKaIZURFHpJQm3p5_GMKJDzoViWnhWUcsSTwQ-C4OUxExz4R-j1jpbmxOEBfAeDR8HqY6Z8QOpNRDHFKgeo4QrforadvLL17pCxnIz77M_7l-j3f50GC2jwejxHO1ZY9TpVxeoVeSluQSgL9RVZd4Pv7ahUQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2023+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=ViLEM%3A+Visual-Language+Error+Modeling+for+Image-Text+Retrieval&rft.au=Chen%2C+Yuxin&rft.au=Ma%2C+Zongyang&rft.au=Zhang%2C+Ziqi&rft.au=Qi%2C+Zhongang&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=2575-7075&rft.spage=11018&rft.epage=11027&rft_id=info:doi/10.1109%2FCVPR52729.2023.01060&rft.externalDocID=10203985