ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignore...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 11018 - 11027
Main Authors Chen, Yuxin, Ma, Zongyang, Zhang, Ziqi, Qi, Zhongang, Yuan, Chunfeng, Shan, Ying, Li, Bing, HU, Weiming, Qie, Xiaohu, Wu, JianPing
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.
AbstractList Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Further-more, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval.
Author HU, Weiming
Shan, Ying
Wu, JianPing
Chen, Yuxin
Qi, Zhongang
Zhang, Ziqi
Yuan, Chunfeng
Qie, Xiaohu
Ma, Zongyang
Li, Bing
Author_xml – sequence: 1
  givenname: Yuxin
  surname: Chen
  fullname: Chen, Yuxin
  email: chenyuxin2019@ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 2
  givenname: Zongyang
  surname: Ma
  fullname: Ma, Zongyang
  email: mazongyang2020@ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 3
  givenname: Ziqi
  surname: Zhang
  fullname: Zhang, Ziqi
  email: ziqi.zhang@nlpr.ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 4
  givenname: Zhongang
  surname: Qi
  fullname: Qi, Zhongang
  email: zhongangqi@tencent.com
  organization: ARC Lab
– sequence: 5
  givenname: Chunfeng
  surname: Yuan
  fullname: Yuan, Chunfeng
  email: cfyuan@nlpr.ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 6
  givenname: Ying
  surname: Shan
  fullname: Shan, Ying
  email: yingsshan@tencent.com
  organization: ARC Lab
– sequence: 7
  givenname: Bing
  surname: Li
  fullname: Li, Bing
  email: bli@nlpr.ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 8
  givenname: Weiming
  surname: HU
  fullname: HU, Weiming
  email: wmhu@nlpr.ia.ac.cn
  organization: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
– sequence: 9
  givenname: Xiaohu
  surname: Qie
  fullname: Qie, Xiaohu
  email: tigerqie@tencent.com
  organization: Tencent PCG
– sequence: 10
  givenname: JianPing
  surname: Wu
  fullname: Wu, JianPing
  email: jianping@cernet.edu.cn
  organization: Tsinghua University
BookMark eNotj91KhFAURk9R0DT5BnPhC2j7_HnO7iIIsRpwKIbJ2-GoWzngaKgT9fYJdfWxWLDgu2VX_dATYxsOMeeA92nxvtfCCIwFCBkDhwQuWIAGrdQggQu0l2wltNGRAaNvWDBNvgQtAIxEu2KPhc-z3UNY-Onsuih3fXt2LYXZOA5juBtq6nzfhs0C29MiogN9z-Ge5tHTl-vu2HXjuomC_12zj-fskL5G-dvLNn3KIy9AzZElh4YQFSZlmSiBJZQCeV1rCSiJFFRGa1uqynLOlVA1t4lUJmnAqUpbuWabv64nouPn6E9u_DlyELC80PIX6EdKBQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52729.2023.01060
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350301298
EISSN 2575-7075
EndPage 11027
ExternalDocumentID 10203985
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583
IEDL.DBID RIE
IngestDate Wed Jun 26 19:26:17 EDT 2024
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-8ea97e99496bb6429b0b291dd53093ee40c7558b4c8111424d1863476f0a4c583
PageCount 10
ParticipantIDs ieee_primary_10203985
PublicationCentury 2000
PublicationDate 2023-June
PublicationDateYYYYMMDD 2023-06-01
PublicationDate_xml – month: 06
  year: 2023
  text: 2023-June
PublicationDecade 2020
PublicationTitle 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev CVPR
PublicationYear 2023
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib052007398
ssib042469789
Score 2.289945
Snippet Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image...
SourceID ieee
SourceType Publisher
StartPage 11018
SubjectTerms and reasoning
Computational modeling
Computer architecture
Computer vision
Data models
language
Pattern recognition
Semantics
Vision
Visualization
Title ViLEM: Visual-Language Error Modeling for Image-Text Retrieval
URI https://ieeexplore.ieee.org/document/10203985
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZoJyZAFPGWB9YEx7ETm4GlalVQW1VVW3WrYseRKqBFIVn49dwlKS8JiS3KQ7J1du6-8333EXIDJtbgSTMAORmmbkTsmRg2njY8kRhfiBS5w6NxNJiLx6VcNmT1igvjnKuKz5yPl9VZfrq1JabKYIdzFmolW6SlGK_JWrvFIzgAvW-t07GdUAwvN3S5gOnb7mIylRyiSR81w31EQ-yHqErlU_oHZLwbTV1K8uSXhfHt-69Gjf8e7iHpfNH36OTTMR2RPbc5JveL9bA3uqOL9VuZPHvDJk9Je3m-zSlKoiExnUIMSx9e4IE3g982nVaCW7AaO2Te7826A68RT_DWnInCUy7RsdNgisgYABnaMMN1kKYSzz6dE8zGUiojrAqQTyvSQEWhiKOMJcJKFZ6Q9ma7caeEKkA9Fj6OMpsIF0baWoCNGQA9wZk08ox0cPKr17o_xmo37_M_7l-QfTRAXXB1SdpFXrorcO2Fua5M-gE1SZ39
link.rule.ids 310,311,786,790,795,796,802,27956,55107
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELagDDABoog3HlgTHMdObAaWqlULaVVVbdWtih1HqoAGhWTh13NOUl4SEluUh2Tr7Hz3ne-7Q-gGTCwBSVMgOakN3bDQUSFsPKlozK1_wRKrHR6Ogv6MPSz4ohGrV1oYY0yVfGZce1md5SeZLm2oDHY4Jb4UfBvtANCTsJZrbZYPo0D1vhVPtwWFQni9Ecx5RN525uMJp-BPurZruGv5EPnRVqVCld4-Gm3GUyeTPLlloVz9_qtU478HfIDaXwI-PP6EpkO0ZdZH6H6-irrDOzxfvZXxsxM1kUrczfMsx7YpmpWmY_Bi8eAFHjhT-HHjSdVyC9ZjG8163Wmn7zTtE5wVJaxwhIllaCQYI1AKaIZURFHpJQm3p5_GMKJDzoViWnhWUcsSTwQ-C4OUxExz4R-j1jpbmxOEBfAeDR8HqY6Z8QOpNRDHFKgeo4QrforadvLL17pCxnIz77M_7l-j3f50GC2jwejxHO1ZY9TpVxeoVeSluQSgL9RVZd4Pv7ahUQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2023+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=ViLEM%3A+Visual-Language+Error+Modeling+for+Image-Text+Retrieval&rft.au=Chen%2C+Yuxin&rft.au=Ma%2C+Zongyang&rft.au=Zhang%2C+Ziqi&rft.au=Qi%2C+Zhongang&rft.date=2023-06-01&rft.pub=IEEE&rft.eissn=2575-7075&rft.spage=11018&rft.epage=11027&rft_id=info:doi/10.1109%2FCVPR52729.2023.01060&rft.externalDocID=10203985