MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks...
Saved in:
Main Authors | , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.10.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Interleaved multimodal comprehension and generation, enabling models to
produce and interpret both images and text in arbitrary sequences, have become
a pivotal area in multimodal learning. Despite significant advancements, the
evaluation of this capability remains insufficient. Existing benchmarks suffer
from limitations in data scale, scope, and evaluation depth, while current
evaluation metrics are often costly or biased, lacking in reliability for
practical applications. To address these challenges, we introduce MMIE, a
large-scale knowledge-intensive benchmark for evaluating interleaved multimodal
comprehension and generation in Large Vision-Language Models (LVLMs). MMIE
comprises 20K meticulously curated multimodal queries, spanning 3 categories,
12 fields, and 102 subfields, including mathematics, coding, physics,
literature, health, and arts. It supports both interleaved inputs and outputs,
offering a mix of multiple-choice and open-ended question formats to evaluate
diverse competencies. Moreover, we propose a reliable automated evaluation
metric, leveraging a scoring model fine-tuned with human-annotated data and
systematic evaluation criteria, aimed at reducing bias and improving evaluation
accuracy. Extensive experiments demonstrate the effectiveness of our benchmark
and metrics in providing a comprehensive evaluation of interleaved LVLMs.
Specifically, we evaluate eight LVLMs, revealing that even the best models show
significant room for improvement, with most achieving only moderate results. We
believe MMIE will drive further advancements in the development of interleaved
LVLMs. We publicly release our benchmark and code in
https://mmie-bench.github.io/. |
---|---|
AbstractList | Interleaved multimodal comprehension and generation, enabling models to
produce and interpret both images and text in arbitrary sequences, have become
a pivotal area in multimodal learning. Despite significant advancements, the
evaluation of this capability remains insufficient. Existing benchmarks suffer
from limitations in data scale, scope, and evaluation depth, while current
evaluation metrics are often costly or biased, lacking in reliability for
practical applications. To address these challenges, we introduce MMIE, a
large-scale knowledge-intensive benchmark for evaluating interleaved multimodal
comprehension and generation in Large Vision-Language Models (LVLMs). MMIE
comprises 20K meticulously curated multimodal queries, spanning 3 categories,
12 fields, and 102 subfields, including mathematics, coding, physics,
literature, health, and arts. It supports both interleaved inputs and outputs,
offering a mix of multiple-choice and open-ended question formats to evaluate
diverse competencies. Moreover, we propose a reliable automated evaluation
metric, leveraging a scoring model fine-tuned with human-annotated data and
systematic evaluation criteria, aimed at reducing bias and improving evaluation
accuracy. Extensive experiments demonstrate the effectiveness of our benchmark
and metrics in providing a comprehensive evaluation of interleaved LVLMs.
Specifically, we evaluate eight LVLMs, revealing that even the best models show
significant room for improvement, with most achieving only moderate results. We
believe MMIE will drive further advancements in the development of interleaved
LVLMs. We publicly release our benchmark and code in
https://mmie-bench.github.io/. |
Author | Qiu, Shi Wang, Zhaoyang Li, Linjie Yao, Huaxiu Chen, Zhaorun Zheng, Wenhao Han, Siwei Ding, Mingyu Xia, Peng Cui, Chenhang Zhou, Yiyang Wang, Lijuan |
Author_xml | – sequence: 1 givenname: Peng surname: Xia fullname: Xia, Peng – sequence: 2 givenname: Siwei surname: Han fullname: Han, Siwei – sequence: 3 givenname: Shi surname: Qiu fullname: Qiu, Shi – sequence: 4 givenname: Yiyang surname: Zhou fullname: Zhou, Yiyang – sequence: 5 givenname: Zhaoyang surname: Wang fullname: Wang, Zhaoyang – sequence: 6 givenname: Wenhao surname: Zheng fullname: Zheng, Wenhao – sequence: 7 givenname: Zhaorun surname: Chen fullname: Chen, Zhaorun – sequence: 8 givenname: Chenhang surname: Cui fullname: Cui, Chenhang – sequence: 9 givenname: Mingyu surname: Ding fullname: Ding, Mingyu – sequence: 10 givenname: Linjie surname: Li fullname: Li, Linjie – sequence: 11 givenname: Lijuan surname: Wang fullname: Wang, Lijuan – sequence: 12 givenname: Huaxiu surname: Yao fullname: Yao, Huaxiu |
BackLink | https://doi.org/10.48550/arXiv.2410.10139$$DView paper in arXiv |
BookMark | eNqFjs0KgkAUhWdRi_4eoFX3BTRNhWpZGAnOLtqFXPKqQ-NMzJjU26fSvtXh_HD4pmyktCLGlr7nhtso8tZo3qJ1N2EX-J4f7CbsxnkS74GjtaIl4C_ZiFrnKCFRDRlJ2FIOR10_DVWkrNAKDqTuVY3mAYU2kKIpCa6ir5wUVfnCznOdk7RzNi5QWlr8dMZWp_hyPDsDSPY0orv5ZD1QNgAF_xdftoZCtw |
ContentType | Journal Article |
Copyright | http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2410.10139 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2410_10139 |
GroupedDBID | AKY GOX |
ID | FETCH-arxiv_primary_2410_101393 |
IEDL.DBID | GOX |
IngestDate | Wed Oct 16 12:30:13 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-arxiv_primary_2410_101393 |
OpenAccessLink | https://arxiv.org/abs/2410.10139 |
ParticipantIDs | arxiv_primary_2410_10139 |
PublicationCentury | 2000 |
PublicationDate | 2024-10-14 |
PublicationDateYYYYMMDD | 2024-10-14 |
PublicationDate_xml | – month: 10 year: 2024 text: 2024-10-14 day: 14 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 3.870409 |
SecondaryResourceType | preprint |
Snippet | Interleaved multimodal comprehension and generation, enabling models to
produce and interpret both images and text in arbitrary sequences, have become
a... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
Title | MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models |
URI | https://arxiv.org/abs/2410.10139 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LTwIxEJ4AJy9GowYf6By8Nkr3YcpNDAiGxYuavZhN67bByMMsSPj5TmfXyIVrO2kmfX1fp_MAuJaGMMO5SBidGxESoRXG3d4J4v42jB0hFpd7S8bx4DV8SqO0BvgXC6OLzee6zA9sljcEL_y-DFQd6lJ6l63H57T8nORUXJX8vxxxTG7aAon-AexX7A7vy-U4hJqdH8F7kgx7HUyIptLVghzyOlvkJMfmuKnVa5ujP5iFnXh_8sUcu7R5JjNdfCGRShx5d2184zBwMaosjOjLmE2Xx3DV7708DAQrlH2X2SMyr2vGugYn0KA3vm0CBrFtO-X5lo2IwEitTESwoZX60D577Ck0d41ytrvrHPYkYbC_atvhBTRWxY9tEYauzCVP5C8hunWF |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MMIE%3A+Massive+Multimodal+Interleaved+Comprehension+Benchmark+for+Large+Vision-Language+Models&rft.au=Xia%2C+Peng&rft.au=Han%2C+Siwei&rft.au=Qiu%2C+Shi&rft.au=Zhou%2C+Yiyang&rft.date=2024-10-14&rft_id=info:doi/10.48550%2Farxiv.2410.10139&rft.externalDocID=2410_10139 |