ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignore...
Saved in:
Published in | 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11018 - 11027 |
---|---|
Main Authors | , , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval. |
---|---|
AbstractList | Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval. |
Author | HU, Weiming Shan, Ying Wu, JianPing Chen, Yuxin Qi, Zhongang Zhang, Ziqi Yuan, Chunfeng Qie, Xiaohu Ma, Zongyang Li, Bing |
Author_xml | – sequence: 1 givenname: Yuxin surname: Chen fullname: Chen, Yuxin email: chenyuxin2019@ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 2 givenname: Zongyang surname: Ma fullname: Ma, Zongyang email: mazongyang2020@ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 3 givenname: Ziqi surname: Zhang fullname: Zhang, Ziqi email: ziqi.zhang@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 4 givenname: Zhongang surname: Qi fullname: Qi, Zhongang email: zhongangqi@tencent.com organization: ARC Lab – sequence: 5 givenname: Chunfeng surname: Yuan fullname: Yuan, Chunfeng email: cfyuan@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 6 givenname: Ying surname: Shan fullname: Shan, Ying email: yingsshan@tencent.com organization: ARC Lab – sequence: 7 givenname: Bing surname: Li fullname: Li, Bing email: bli@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 8 givenname: Weiming surname: HU fullname: HU, Weiming email: wmhu@nlpr.ia.ac.cn organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences – sequence: 9 givenname: Xiaohu surname: Qie fullname: Qie, Xiaohu email: tigerqie@tencent.com organization: Tencent PCG – sequence: 10 givenname: JianPing surname: Wu fullname: Wu, JianPing email: jianping@cernet.edu.cn organization: Tsinghua University |
BookMark | eNotj91KhFAURk9R0DT5BnPhC2j7_HnO7iIIsRpwKIbJ2-GoWzngaKgT9fYJdfWxWLDgu2VX_dATYxsOMeeA92nxvtfCCIwFCBkDhwQuWIAGrdQggQu0l2wltNGRAaNvWDBNvgQtAIxEu2KPhc-z3UNY-Onsuih3fXt2LYXZOA5juBtq6nzfhs0C29MiogN9z-Ge5tHTl-vu2HXjuomC_12zj-fskL5G-dvLNn3KIy9AzZElh4YQFSZlmSiBJZQCeV1rCSiJFFRGa1uqynLOlVA1t4lUJmnAqUpbuWabv64nouPn6E9u_DlyELC80PIX6EdKBQ |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR52729.2023.01060 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9798350301298 |
EISSN | 2575-7075 |
EndPage | 11027 |
ExternalDocumentID | 10203985 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583 |
IEDL.DBID | RIE |
IngestDate | Wed Jun 26 19:26:17 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583 |
PageCount | 10 |
ParticipantIDs | ieee_primary_10203985 |
PublicationCentury | 2000 |
PublicationDate | 2023-June |
PublicationDateYYYYMMDD | 2023-06-01 |
PublicationDate_xml | – month: 06 year: 2023 text: 2023-June |
PublicationDecade | 2020 |
PublicationTitle | 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2023 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssib052007398 ssib042469789 |
Score | 2.289945 |
Snippet | Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 11018 |
SubjectTerms | and reasoning Computational modeling Computer architecture Computer vision Data models language Pattern recognition Semantics Vision Visualization |
Title | ViLEM: Visual-Language Error Modeling for Image-Text Retrieval |
URI | https://ieeexplore.ieee.org/document/10203985 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZoJyZAFPGWB9YEx7ETm4GlalVQW1VVW3WrYseRKqBFIVn49dwlKS8JiS3KQ7J1du6-8333EXIDJtbgSTMAORmmbkTsmRg2njY8kRhfiBS5w6NxNJiLx6VcNmT1igvjnKuKz5yPl9VZfrq1JabKYIdzFmolW6SlGK_JWrvFIzgAvW-t07GdUAwvN3S5gOnb7mIylRyiSR81w31EQ-yHqErlU_oHZLwbTV1K8uSXhfHt-69Gjf8e7iHpfNH36OTTMR2RPbc5JveL9bA3uqOL9VuZPHvDJk9Je3m-zSlKoiExnUIMSx9e4IE3g982nVaCW7AaO2Te7826A68RT_DWnInCUy7RsdNgisgYABnaMMN1kKYSzz6dE8zGUiojrAqQTyvSQEWhiKOMJcJKFZ6Q9ma7caeEKkA9Fj6OMpsIF0baWoCNGQA9wZk08ox0cPKr17o_xmo37_M_7l-QfTRAXXB1SdpFXrorcO2Fua5M-gE1SZ39 |
link.rule.ids | 310,311,786,790,795,796,802,27956,55107 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELagDDABoog3HlgTHMdObAaWqlULaVVVbdWtih1HqoAGhWTh13NOUl4SEluUh2Tr7Hz3ne-7Q-gGTCwBSVMgOakN3bDQUSFsPKlozK1_wRKrHR6Ogv6MPSz4ohGrV1oYY0yVfGZce1md5SeZLm2oDHY4Jb4UfBvtANCTsJZrbZYPo0D1vhVPtwWFQni9Ecx5RN525uMJp-BPurZruGv5EPnRVqVCld4-Gm3GUyeTPLlloVz9_qtU478HfIDaXwI-PP6EpkO0ZdZH6H6-irrDOzxfvZXxsxM1kUrczfMsx7YpmpWmY_Bi8eAFHjhT-HHjSdVyC9ZjG8163Wmn7zTtE5wVJaxwhIllaCQYI1AKaIZURFHpJQm3p5_GMKJDzoViWnhWUcsSTwQ-C4OUxExz4R-j1jpbmxOEBfAeDR8HqY6Z8QOpNRDHFKgeo4QrforadvLL17pCxnIz77M_7l-j3f50GC2jwejxHO1ZY9TpVxeoVeSluQSgL9RVZd4Pv7ahUQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2023+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=ViLEM%3A+Visual-Language+Error+Modeling+for+Image-Text+Retrieval&rft.au=Chen%2C+Yuxin&rft.au=Ma%2C+Zongyang&rft.au=Zhang%2C+Ziqi&rft.au=Qi%2C+Zhongang&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=2575-7075&rft.spage=11018&rft.epage=11027&rft_id=info:doi/10.1109%2FCVPR52729.2023.01060&rft.externalDocID=10203985 |