Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships
Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g...
Saved in:
Published in | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 15586 - 15595 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2022
|
Subjects | |
Online Access | Get full text |
ISSN | 1063-6919 |
DOI | 10.1109/CVPR52688.2022.01516 |
Cover
Abstract | Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction. |
---|---|
AbstractList | Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction. |
Author | Zheng, Zilong Han, Wenjuan Lin, Yuhuan Lou, Chao |
Author_xml | – sequence: 1 givenname: Chao surname: Lou fullname: Lou, Chao email: louchao@shanghaitech.edu.cn organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China – sequence: 2 givenname: Wenjuan surname: Han fullname: Han, Wenjuan email: hanwenjuan@bigai.ai organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China – sequence: 3 givenname: Yuhuan surname: Lin fullname: Lin, Yuhuan email: lin-yH20@mails.tsinghua.edu.cn organization: Tsinghua Unversity,Beijing,China – sequence: 4 givenname: Zilong surname: Zheng fullname: Zheng, Zilong email: zlzheng@bigai.ai organization: Beijing Institute for General Artificial Intelligence (BIGAI),Beijing,China |
BookMark | eNo9jcFOAjEURavRREC-QBf9gcG-diitO0VFExIJCFtS2zdDzVAm7QyGtT8uRuPqJjfnntslZ2EXkJBrYAMApm_Gq9l8yKVSA844HzAYgjwhXZBymEudS3FKOsCkyKQGfUH6KX0wxgQHkFp1yNcypLbGuPcJHV355Hchm5pQtqZEOjMx-VDe0gWabYUpVQd6H70rj-UP3JqKLiwGpJNo6k2in77Z0P_5oomtbdqIie69oQ9YY3AY7IHOsTLN8SptfJ0uyXlhqoT9v-yR5dPj2_g5m75OXsZ308xzJprMFu8I4NAZVaBjTo0stzlTwrmR00bZkcZc8gKU1M5wpXjOcl0IrTSaXFrRI1e_Xo-I6zr6rYmHtVaKAYD4BvTqZiU |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR52688.2022.01516 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 1665469463 9781665469463 |
EISSN | 1063-6919 |
EndPage | 15595 |
ExternalDocumentID | 9880111 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:15:10 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i203t-cfbe11deda8fed0d87c2c4083dd7d9a8c79e462f1869da28824049f3989ea46c3 |
PageCount | 10 |
ParticipantIDs | ieee_primary_9880111 |
PublicationCentury | 2000 |
PublicationDate | 2022-June |
PublicationDateYYYYMMDD | 2022-06-01 |
PublicationDate_xml | – month: 06 year: 2022 text: 2022-June |
PublicationDecade | 2020 |
PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2022 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0003211698 |
Score | 2.2679465 |
Snippet | Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 15586 |
SubjectTerms | Benchmark testing Buildings Computer vision Grounding Linguistics Pattern recognition Vision + language; Explainable computer vision Visualization |
Title | Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships |
URI | https://ieeexplore.ieee.org/document/9880111 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4gJ0-oYHxnDx5todvtYz2KIjFiiAjhRvbVSMRCUmqCV_-4s9tajfHgqc2kj-12OvvNdL4ZhM5BKoli1IkSoRzqRYnDDdldBIkUkZCaWCLt4CHsj-ndNJjW0EXFhdFa2-Qz7Zpd-y9fLWVuQmVtBsrmGSLvFqhZwdWq4ik-eDIhi0t2nNdh7e5k-GiKmZgELkJcWPdMU_MfPVTsEtJroMHXzYvMkRc3XwtXvv-qy_jf0e2g1jdZDw-rZWgX1XS6hxolusTlt5s10cc4zfKVMQ0ZyCeWU-7cl-FKPOQ2anCJR5q_Lgz83OArQ-YCoTk45wu4FhhGfGtKXGfYBHBxdfrIlqHNwXfHb3OOr8vWunKDq2y75_kqa6Fx7-ap23fKFgzOnHT8tSMToT1PacXjRKuOiiNJJAXYplSkGI9lxDQNSWIaWylOAK5TcDkSn8VMcxpKfx_V02WqDxBmRIEzCQbSI4LqQAlNlRAxbADTqcA7RE0zp7NVUWVjVk7n0d_iY7Rt3mqRtHWC6vCU-hTgwVqcWb34BAupvvM |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEN4QPOjJBxjf7sGjLXTZPtajKKICIfIIN7KvRiIWklITvPrHnW1rNcaDpzaTPrbb7cw30_lmELoAqSSKUcsPhbKo44cWN2R34YZS-EJqkhJpuz2vPaIPE3dSQpcFF0ZrnSafadvspv_y1UImJlRWY7DYHEPk3QC7T92MrVVEVBrgy3gsyPlxTp3VmuP-kylnYlK4CLHB8pm25j-6qKRGpLWNul-3z3JHXuxkJWz5_qsy43_Ht4Oq33Q93C8M0S4q6WgPbef4Eudfb1xBH6MoTpZGOcQgH6escquTByxxn6dxgys80Px1bgDoGl8bOhcIzcEJn8O1QDXiO1PkOsYmhIuL0wdpIdoEvHf8NuP4Jm-uK9e4yLd7ni3jKhq1bofNtpU3YbBmpN5YWTIU2nGUVjwItaqrwJdEUgBuSvmK8UD6TFOPhKa1leIEADsFpyNssIBpTj3Z2EflaBHpA4QZUeBOgop0iKDaVUJTJUQAG0B1ynUOUcXM6XSZ1dmY5tN59Lf4HG22h93OtHPfezxGW-YNZylcJ6gMT6xPASysxFm6Rj4BMorCQA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Unsupervised+Vision-Language+Parsing%3A+Seamlessly+Bridging+Visual+Scene+Graphs+with+Language+Structures+via+Dependency+Relationships&rft.au=Lou%2C+Chao&rft.au=Han%2C+Wenjuan&rft.au=Lin%2C+Yuhuan&rft.au=Zheng%2C+Zilong&rft.date=2022-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=15586&rft.epage=15595&rft_id=info:doi/10.1109%2FCVPR52688.2022.01516&rft.externalDocID=9880111 |