Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 15586 - 15595
Main Authors Lou, Chao, Han, Wenjuan, Lin, Yuhuan, Zheng, Zilong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2022
Subjects
Online AccessGet full text
ISSN1063-6919
DOI10.1109/CVPR52688.2022.01516

Cover

Abstract Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
AbstractList Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
Author Zheng, Zilong
Han, Wenjuan
Lin, Yuhuan
Lou, Chao
Author_xml – sequence: 1
  givenname: Chao
  surname: Lou
  fullname: Lou, Chao
  email: louchao@shanghaitech.edu.cn
  organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China
– sequence: 2
  givenname: Wenjuan
  surname: Han
  fullname: Han, Wenjuan
  email: hanwenjuan@bigai.ai
  organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China
– sequence: 3
  givenname: Yuhuan
  surname: Lin
  fullname: Lin, Yuhuan
  email: lin-yH20@mails.tsinghua.edu.cn
  organization: Tsinghua Unversity,Beijing,China
– sequence: 4
  givenname: Zilong
  surname: Zheng
  fullname: Zheng, Zilong
  email: zlzheng@bigai.ai
  organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China
BookMark eNo9jcFOAjEURavRREC-QBf9gcG-diitO0VFExIJCFtS2zdDzVAm7QyGtT8uRuPqJjfnntslZ2EXkJBrYAMApm_Gq9l8yKVSA844HzAYgjwhXZBymEudS3FKOsCkyKQGfUH6KX0wxgQHkFp1yNcypLbGuPcJHV355Hchm5pQtqZEOjMx-VDe0gWabYUpVQd6H70rj-UP3JqKLiwGpJNo6k2in77Z0P_5oomtbdqIie69oQ9YY3AY7IHOsTLN8SptfJ0uyXlhqoT9v-yR5dPj2_g5m75OXsZ308xzJprMFu8I4NAZVaBjTo0stzlTwrmR00bZkcZc8gKU1M5wpXjOcl0IrTSaXFrRI1e_Xo-I6zr6rYmHtVaKAYD4BvTqZiU
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52688.2022.01516
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 1665469463
9781665469463
EISSN 1063-6919
EndPage 15595
ExternalDocumentID 9880111
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3
PageCount 10
ParticipantIDs ieee_primary_9880111
PublicationCentury 2000
PublicationDate 2022-June
PublicationDateYYYYMMDD 2022-06-01
PublicationDate_xml – month: 06
  year: 2022
  text: 2022-June
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.2679465
Snippet Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have...
SourceID ieee
SourceType Publisher
StartPage 15586
SubjectTerms Benchmark testing
Buildings
Computer vision
Grounding
Linguistics
Pattern recognition
Vision + language; Explainable computer vision
Visualization
Title Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
URI https://ieeexplore.ieee.org/document/9880111
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4gJ0-oYHxnDx5todvtYz2KIjFiiAjhRvbVSMRCUmqCV_-4s9tajfHgqc2kj-12OvvNdL4ZhM5BKoli1IkSoRzqRYnDDdldBIkUkZCaWCLt4CHsj-ndNJjW0EXFhdFa2-Qz7Zpd-y9fLWVuQmVtBsrmGSLvFqhZwdWq4ik-eDIhi0t2nNdh7e5k-GiKmZgELkJcWPdMU_MfPVTsEtJroMHXzYvMkRc3XwtXvv-qy_jf0e2g1jdZDw-rZWgX1XS6hxolusTlt5s10cc4zfKVMQ0ZyCeWU-7cl-FKPOQ2anCJR5q_Lgz83OArQ-YCoTk45wu4FhhGfGtKXGfYBHBxdfrIlqHNwXfHb3OOr8vWunKDq2y75_kqa6Fx7-ap23fKFgzOnHT8tSMToT1PacXjRKuOiiNJJAXYplSkGI9lxDQNSWIaWylOAK5TcDkSn8VMcxpKfx_V02WqDxBmRIEzCQbSI4LqQAlNlRAxbADTqcA7RE0zp7NVUWVjVk7n0d_iY7Rt3mqRtHWC6vCU-hTgwVqcWb34BAupvvM
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4QPOjJBxjf7sGjLXTZPtajKKICIfIIN7KvRiIWklITvPrHnW1rNcaDpzaTPrbb7cw30_lmELoAqSSKUcsPhbKo44cWN2R34YZS-EJqkhJpuz2vPaIPE3dSQpcFF0ZrnSafadvspv_y1UImJlRWY7DYHEPk3QC7T92MrVVEVBrgy3gsyPlxTp3VmuP-kylnYlK4CLHB8pm25j-6qKRGpLWNul-3z3JHXuxkJWz5_qsy43_Ht4Oq33Q93C8M0S4q6WgPbef4Eudfb1xBH6MoTpZGOcQgH6escquTByxxn6dxgys80Px1bgDoGl8bOhcIzcEJn8O1QDXiO1PkOsYmhIuL0wdpIdoEvHf8NuP4Jm-uK9e4yLd7ni3jKhq1bofNtpU3YbBmpN5YWTIU2nGUVjwItaqrwJdEUgBuSvmK8UD6TFOPhKa1leIEADsFpyNssIBpTj3Z2EflaBHpA4QZUeBOgop0iKDaVUJTJUQAG0B1ynUOUcXM6XSZ1dmY5tN59Lf4HG22h93OtHPfezxGW-YNZylcJ6gMT6xPASysxFm6Rj4BMorCQA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Unsupervised+Vision-Language+Parsing%3A+Seamlessly+Bridging+Visual+Scene+Graphs+with+Language+Structures+via+Dependency+Relationships&rft.au=Lou%2C+Chao&rft.au=Han%2C+Wenjuan&rft.au=Lin%2C+Yuhuan&rft.au=Zheng%2C+Zilong&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=15586&rft.epage=15595&rft_id=info:doi/10.1109%2FCVPR52688.2022.01516&rft.externalDocID=9880111