VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue

Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To th...

Full description

Saved in:

Bibliographic Details
Main Authors	Li, Yunshui, Hui, Binyuan, Yin, Zhaochao, He, Wanwei, Luo, Run, Long, Yuxing, Yang, Min, Huang, Fei, Li, Yongbin
Format	Journal Article
Language	English
Published	13.09.2023
Subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

Abstract	Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.
AbstractList	Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded \textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It defines five core multi-modal dialogue tasks and covers six datasets. Furthermore, in order to provide a comprehensive assessment of the model's performance across all tasks, we developed a novel evaluation metric called VDscore, which is based on the Analytic Hierarchy Process~(AHP) method. Additionally, we present a straightforward yet efficient baseline model, named \textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog \textbf{T}ransformer), to promote the advancement of general multi-modal dialogue systems. It progressively builds its multi-modal foundation and dialogue capability via a two-stage pre-training strategy. We believe that the VDialogUE benchmark, along with the evaluation scripts and our baseline models, will accelerate the development of visually-grounded dialog systems and lead to the development of more sophisticated and effective pre-trained models.
Author	Hui, Binyuan Yang, Min Huang, Fei Long, Yuxing Li, Yongbin Li, Yunshui Luo, Run He, Wanwei Yin, Zhaochao
Author_xml	– sequence: 1 givenname: Yunshui surname: Li fullname: Li, Yunshui – sequence: 2 givenname: Binyuan surname: Hui fullname: Hui, Binyuan – sequence: 3 givenname: Zhaochao surname: Yin fullname: Yin, Zhaochao – sequence: 4 givenname: Wanwei surname: He fullname: He, Wanwei – sequence: 5 givenname: Run surname: Luo fullname: Luo, Run – sequence: 6 givenname: Yuxing surname: Long fullname: Long, Yuxing – sequence: 7 givenname: Min surname: Yang fullname: Yang, Min – sequence: 8 givenname: Fei surname: Huang fullname: Huang, Fei – sequence: 9 givenname: Yongbin surname: Li fullname: Li, Yongbin
BackLink	https://doi.org/10.48550/arXiv.2309.07387$$DView paper in arXiv
BookMark	eNotj81OAjEYRbvAhaIP4Mq-wIyd_usOcRQSEjfAdvJ12kJjbU1xiLy9CKzu5t6Te27QKOXkELpvSM21EOQRym_Y15SRp5ooptU1mq1fA8S8WbXPeIJXKfjgLG73EAf4CTnhF5f67ReUT-xzweuwGyDGQ7UpeUj2WD3PB3eLrjzEnbu75Bgt39rldFYtPt7n08miAqlUBUo764Dpxjrdc0ZACk-p4Bw4sY0ylGpDpVBe-94pTqTUxJvGGG1Iz4GN0cMZe1Lpvks4fjt0_0rdSYn9AeQ5SBU
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2309.07387
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2309_07387
GroupedDBID	AKY GOX
ID	FETCH-LOGICAL-a677-a78edea381de8c430a65f22544a40d17b228b2657f8fce7406680fb1bb8b0c4a3
IEDL.DBID	GOX
IngestDate	Mon Jan 08 05:40:06 EST 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a677-a78edea381de8c430a65f22544a40d17b228b2657f8fce7406680fb1bb8b0c4a3
OpenAccessLink	https://arxiv.org/abs/2309.07387
ParticipantIDs	arxiv_primary_2309_07387
PublicationCentury	2000
PublicationDate	2023-09-13
PublicationDateYYYYMMDD	2023-09-13
PublicationDate_xml	– month: 09 year: 2023 text: 2023-09-13 day: 13
PublicationDecade	2020
PublicationYear	2023
Score	1.8986235
SecondaryResourceType	preprint
Snippet	Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Title	VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue
URI	https://arxiv.org/abs/2309.07387
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDI5KJxYEAlSeysAauNzlkhxbgZYKCVjaqlvl5BxRUQrqA8G_x7krj4U1cQY7jvzFsb8wdqYCptIrJ0qJSig0XoC0ichDJOtCraCI3cj3D7o3UHejfNRg_LsXBuYfk_eaH9gtLggfF-fkhNZssI00jSVbt4-j-nGyouJay__KEcashv4Eie4221qjO96ut2OHNXC2y3rDm0lMkQw6l7zNCeQFgn2880Ozza_IU55eYP7MCUHy4WSxgun0U8SGi5if5vXyFe6xfrfTv-6J9QcGArQxAozFEoFiYonWqywBnYc0coKBSkppHGnkUp2bYINHQ6FV2yQ46Zx1iVeQ7bPm7HWGLcYlAGo6b5k2QflQkAWhLDBVGd1ufWIPWKtSe_xWc1SMo0XGlUUO_586Ypvx9_RY_iCzY9Zczld4QjF26U4rQ38B_qx8Lw
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VDialogUE%3A+A+Unified+Evaluation+Benchmark+for+Visually-grounded+Dialogue&rft.au=Li%2C+Yunshui&rft.au=Hui%2C+Binyuan&rft.au=Yin%2C+Zhaochao&rft.au=He%2C+Wanwei&rft.date=2023-09-13&rft_id=info:doi/10.48550%2Farxiv.2309.07387&rft.externalDocID=2309_07387