VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue
Visually-grounded dialog systems, which integrate multiple modes of communication such as text and visual inputs, have become an increasingly popular area of investigation. However, the absence of a standardized evaluation framework poses a challenge in assessing the development of this field. To th...
Saved in:
Main Authors | , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
13.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Visually-grounded dialog systems, which integrate multiple modes of
communication such as text and visual inputs, have become an increasingly
popular area of investigation. However, the absence of a standardized
evaluation framework poses a challenge in assessing the development of this
field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded
\textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It
defines five core multi-modal dialogue tasks and covers six datasets.
Furthermore, in order to provide a comprehensive assessment of the model's
performance across all tasks, we developed a novel evaluation metric called
VDscore, which is based on the Analytic Hierarchy Process~(AHP) method.
Additionally, we present a straightforward yet efficient baseline model, named
\textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog
\textbf{T}ransformer), to promote the advancement of general multi-modal
dialogue systems. It progressively builds its multi-modal foundation and
dialogue capability via a two-stage pre-training strategy.
We believe that the VDialogUE benchmark, along with the evaluation scripts
and our baseline models, will accelerate the development of visually-grounded
dialog systems and lead to the development of more sophisticated and effective
pre-trained models. |
---|---|
AbstractList | Visually-grounded dialog systems, which integrate multiple modes of
communication such as text and visual inputs, have become an increasingly
popular area of investigation. However, the absence of a standardized
evaluation framework poses a challenge in assessing the development of this
field. To this end, we propose \textbf{VDialogUE}, a \textbf{V}isually-grounded
\textbf{Dialog}ue benchmark for \textbf{U}nified \textbf{E}valuation. It
defines five core multi-modal dialogue tasks and covers six datasets.
Furthermore, in order to provide a comprehensive assessment of the model's
performance across all tasks, we developed a novel evaluation metric called
VDscore, which is based on the Analytic Hierarchy Process~(AHP) method.
Additionally, we present a straightforward yet efficient baseline model, named
\textbf{VISIT}~(\textbf{VIS}ually-grounded d\textbf{I}alog
\textbf{T}ransformer), to promote the advancement of general multi-modal
dialogue systems. It progressively builds its multi-modal foundation and
dialogue capability via a two-stage pre-training strategy.
We believe that the VDialogUE benchmark, along with the evaluation scripts
and our baseline models, will accelerate the development of visually-grounded
dialog systems and lead to the development of more sophisticated and effective
pre-trained models. |
Author | Hui, Binyuan Yang, Min Huang, Fei Long, Yuxing Li, Yongbin Li, Yunshui Luo, Run He, Wanwei Yin, Zhaochao |
Author_xml | – sequence: 1 givenname: Yunshui surname: Li fullname: Li, Yunshui – sequence: 2 givenname: Binyuan surname: Hui fullname: Hui, Binyuan – sequence: 3 givenname: Zhaochao surname: Yin fullname: Yin, Zhaochao – sequence: 4 givenname: Wanwei surname: He fullname: He, Wanwei – sequence: 5 givenname: Run surname: Luo fullname: Luo, Run – sequence: 6 givenname: Yuxing surname: Long fullname: Long, Yuxing – sequence: 7 givenname: Min surname: Yang fullname: Yang, Min – sequence: 8 givenname: Fei surname: Huang fullname: Huang, Fei – sequence: 9 givenname: Yongbin surname: Li fullname: Li, Yongbin |
BackLink | https://doi.org/10.48550/arXiv.2309.07387$$DView paper in arXiv |
BookMark | eNotj81OAjEYRbvAhaIP4Mq-wIyd_usOcRQSEjfAdvJ12kJjbU1xiLy9CKzu5t6Te27QKOXkELpvSM21EOQRym_Y15SRp5ooptU1mq1fA8S8WbXPeIJXKfjgLG73EAf4CTnhF5f67ReUT-xzweuwGyDGQ7UpeUj2WD3PB3eLrjzEnbu75Bgt39rldFYtPt7n08miAqlUBUo764Dpxjrdc0ZACk-p4Bw4sY0ylGpDpVBe-94pTqTUxJvGGG1Iz4GN0cMZe1Lpvks4fjt0_0rdSYn9AeQ5SBU |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2309.07387 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2309_07387 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a677-a78edea381de8c430a65f22544a40d17b228b2657f8fce7406680fb1bb8b0c4a3 |
IEDL.DBID | GOX |
IngestDate | Mon Jan 08 05:40:06 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a677-a78edea381de8c430a65f22544a40d17b228b2657f8fce7406680fb1bb8b0c4a3 |
OpenAccessLink | https://arxiv.org/abs/2309.07387 |
ParticipantIDs | arxiv_primary_2309_07387 |
PublicationCentury | 2000 |
PublicationDate | 2023-09-13 |
PublicationDateYYYYMMDD | 2023-09-13 |
PublicationDate_xml | – month: 09 year: 2023 text: 2023-09-13 day: 13 |
PublicationDecade | 2020 |
PublicationYear | 2023 |
Score | 1.8986235 |
SecondaryResourceType | preprint |
Snippet | Visually-grounded dialog systems, which integrate multiple modes of
communication such as text and visual inputs, have become an increasingly
popular area of... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition |
Title | VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue |
URI | https://arxiv.org/abs/2309.07387 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDI5KJxYEAlSeysAauNzlkhxbgZYKCVjaqlvl5BxRUQrqA8G_x7krj4U1cQY7jvzFsb8wdqYCptIrJ0qJSig0XoC0ichDJOtCraCI3cj3D7o3UHejfNRg_LsXBuYfk_eaH9gtLggfF-fkhNZssI00jSVbt4-j-nGyouJay__KEcashv4Eie4221qjO96ut2OHNXC2y3rDm0lMkQw6l7zNCeQFgn2880Ozza_IU55eYP7MCUHy4WSxgun0U8SGi5if5vXyFe6xfrfTv-6J9QcGArQxAozFEoFiYonWqywBnYc0coKBSkppHGnkUp2bYINHQ6FV2yQ46Zx1iVeQ7bPm7HWGLcYlAGo6b5k2QflQkAWhLDBVGd1ufWIPWKtSe_xWc1SMo0XGlUUO_586Ypvx9_RY_iCzY9Zczld4QjF26U4rQ38B_qx8Lw |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VDialogUE%3A+A+Unified+Evaluation+Benchmark+for+Visually-grounded+Dialogue&rft.au=Li%2C+Yunshui&rft.au=Hui%2C+Binyuan&rft.au=Yin%2C+Zhaochao&rft.au=He%2C+Wanwei&rft.date=2023-09-13&rft_id=info:doi/10.48550%2Farxiv.2309.07387&rft.externalDocID=2309_07387 |