Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 15586 - 15595
Main Authors	Lou, Chao, Han, Wenjuan, Lin, Yuhuan, Zheng, Zilong
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2022
Subjects	Benchmark testing Buildings Computer vision Grounding Linguistics Pattern recognition Vision + language; Explainable computer vision Visualization
Online Access	Get full text
ISSN	1063-6919
DOI	10.1109/CVPR52688.2022.01516

Cover

Abstract	Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
AbstractList	Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.
Author	Zheng, Zilong Han, Wenjuan Lin, Yuhuan Lou, Chao
Author_xml	– sequence: 1 givenname: Chao surname: Lou fullname: Lou, Chao email: louchao@shanghaitech.edu.cn organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China – sequence: 2 givenname: Wenjuan surname: Han fullname: Han, Wenjuan email: hanwenjuan@bigai.ai organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China – sequence: 3 givenname: Yuhuan surname: Lin fullname: Lin, Yuhuan email: lin-yH20@mails.tsinghua.edu.cn organization: Tsinghua Unversity,Beijing,China – sequence: 4 givenname: Zilong surname: Zheng fullname: Zheng, Zilong email: zlzheng@bigai.ai organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China
BookMark	eNo9jcFOAjEURavRREC-QBf9gcG-diitO0VFExIJCFtS2zdDzVAm7QyGtT8uRuPqJjfnntslZ2EXkJBrYAMApm_Gq9l8yKVSA844HzAYgjwhXZBymEudS3FKOsCkyKQGfUH6KX0wxgQHkFp1yNcypLbGuPcJHV355Hchm5pQtqZEOjMx-VDe0gWabYUpVQd6H70rj-UP3JqKLiwGpJNo6k2in77Z0P_5oomtbdqIie69oQ9YY3AY7IHOsTLN8SptfJ0uyXlhqoT9v-yR5dPj2_g5m75OXsZ308xzJprMFu8I4NAZVaBjTo0stzlTwrmR00bZkcZc8gKU1M5wpXjOcl0IrTSaXFrRI1e_Xo-I6zr6rYmHtVaKAYD4BvTqZiU
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR52688.2022.01516
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	1665469463 9781665469463
EISSN	1063-6919
EndPage	15595
ExternalDocumentID	9880111
Genre	orig-research
GroupedDBID	6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO
ID	FETCH-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:15:10 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3
PageCount	10
ParticipantIDs	ieee_primary_9880111
PublicationCentury	2000
PublicationDate	2022-June
PublicationDateYYYYMMDD	2022-06-01
PublicationDate_xml	– month: 06 year: 2022 text: 2022-June
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev	CVPR
PublicationYear	2022
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0003211698
Score	2.2679465
Snippet	Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have...
SourceID	ieee
SourceType	Publisher
StartPage	15586
SubjectTerms	Benchmark testing Buildings Computer vision Grounding Linguistics Pattern recognition Vision + language; Explainable computer vision Visualization
Title	Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
URI	https://ieeexplore.ieee.org/document/9880111
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4gJ0-oYHxnDx5todvtYz2KIjFiiAjhRvbVSMRCUmqCV_-4s9tajfHgqc2kj-12OvvNdL4ZhM5BKoli1IkSoRzqRYnDDdldBIkUkZCaWCLt4CHsj-ndNJjW0EXFhdFa2-Qz7Zpd-y9fLWVuQmVtBsrmGSLvFqhZwdWq4ik-eDIhi0t2nNdh7e5k-GiKmZgELkJcWPdMU_MfPVTsEtJroMHXzYvMkRc3XwtXvv-qy_jf0e2g1jdZDw-rZWgX1XS6hxolusTlt5s10cc4zfKVMQ0ZyCeWU-7cl-FKPOQ2anCJR5q_Lgz83OArQ-YCoTk45wu4FhhGfGtKXGfYBHBxdfrIlqHNwXfHb3OOr8vWunKDq2y75_kqa6Fx7-ap23fKFgzOnHT8tSMToT1PacXjRKuOiiNJJAXYplSkGI9lxDQNSWIaWylOAK5TcDkSn8VMcxpKfx_V02WqDxBmRIEzCQbSI4LqQAlNlRAxbADTqcA7RE0zp7NVUWVjVk7n0d_iY7Rt3mqRtHWC6vCU-hTgwVqcWb34BAupvvM
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4QPOjJBxjf7sGjLXTZPtajKKICIfIIN7KvRiIWklITvPrHnW1rNcaDpzaTPrbb7cw30_lmELoAqSSKUcsPhbKo44cWN2R34YZS-EJqkhJpuz2vPaIPE3dSQpcFF0ZrnSafadvspv_y1UImJlRWY7DYHEPk3QC7T92MrVVEVBrgy3gsyPlxTp3VmuP-kylnYlK4CLHB8pm25j-6qKRGpLWNul-3z3JHXuxkJWz5_qsy43_Ht4Oq33Q93C8M0S4q6WgPbef4Eudfb1xBH6MoTpZGOcQgH6escquTByxxn6dxgys80Px1bgDoGl8bOhcIzcEJn8O1QDXiO1PkOsYmhIuL0wdpIdoEvHf8NuP4Jm-uK9e4yLd7ni3jKhq1bofNtpU3YbBmpN5YWTIU2nGUVjwItaqrwJdEUgBuSvmK8UD6TFOPhKa1leIEADsFpyNssIBpTj3Z2EflaBHpA4QZUeBOgop0iKDaVUJTJUQAG0B1ynUOUcXM6XSZ1dmY5tN59Lf4HG22h93OtHPfezxGW-YNZylcJ6gMT6xPASysxFm6Rj4BMorCQA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Unsupervised+Vision-Language+Parsing%3A+Seamlessly+Bridging+Visual+Scene+Graphs+with+Language+Structures+via+Dependency+Relationships&rft.au=Lou%2C+Chao&rft.au=Han%2C+Wenjuan&rft.au=Lin%2C+Yuhuan&rft.au=Zheng%2C+Zilong&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=15586&rft.epage=15595&rft_id=info:doi/10.1109%2FCVPR52688.2022.01516&rft.externalDocID=9880111