MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks...

Full description

Saved in:
Bibliographic Details
Main Authors Xia, Peng, Han, Siwei, Qiu, Shi, Zhou, Yiyang, Wang, Zhaoyang, Zheng, Wenhao, Chen, Zhaorun, Cui, Chenhang, Ding, Mingyu, Li, Linjie, Wang, Lijuan, Yao, Huaxiu
Format Journal Article
LanguageEnglish
Published 14.10.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
AbstractList Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
Author Qiu, Shi
Wang, Zhaoyang
Li, Linjie
Yao, Huaxiu
Chen, Zhaorun
Zheng, Wenhao
Han, Siwei
Ding, Mingyu
Xia, Peng
Cui, Chenhang
Zhou, Yiyang
Wang, Lijuan
Author_xml – sequence: 1
  givenname: Peng
  surname: Xia
  fullname: Xia, Peng
– sequence: 2
  givenname: Siwei
  surname: Han
  fullname: Han, Siwei
– sequence: 3
  givenname: Shi
  surname: Qiu
  fullname: Qiu, Shi
– sequence: 4
  givenname: Yiyang
  surname: Zhou
  fullname: Zhou, Yiyang
– sequence: 5
  givenname: Zhaoyang
  surname: Wang
  fullname: Wang, Zhaoyang
– sequence: 6
  givenname: Wenhao
  surname: Zheng
  fullname: Zheng, Wenhao
– sequence: 7
  givenname: Zhaorun
  surname: Chen
  fullname: Chen, Zhaorun
– sequence: 8
  givenname: Chenhang
  surname: Cui
  fullname: Cui, Chenhang
– sequence: 9
  givenname: Mingyu
  surname: Ding
  fullname: Ding, Mingyu
– sequence: 10
  givenname: Linjie
  surname: Li
  fullname: Li, Linjie
– sequence: 11
  givenname: Lijuan
  surname: Wang
  fullname: Wang, Lijuan
– sequence: 12
  givenname: Huaxiu
  surname: Yao
  fullname: Yao, Huaxiu
BackLink https://doi.org/10.48550/arXiv.2410.10139$$DView paper in arXiv
BookMark eNqFjs0KgkAUhWdRi_4eoFX3BTRNhWpZGAnOLtqFXPKqQ-NMzJjU26fSvtXh_HD4pmyktCLGlr7nhtso8tZo3qJ1N2EX-J4f7CbsxnkS74GjtaIl4C_ZiFrnKCFRDRlJ2FIOR10_DVWkrNAKDqTuVY3mAYU2kKIpCa6ir5wUVfnCznOdk7RzNi5QWlr8dMZWp_hyPDsDSPY0orv5ZD1QNgAF_xdftoZCtw
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2410.10139
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2410_10139
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2410_101393
IEDL.DBID GOX
IngestDate Wed Oct 16 12:30:13 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2410_101393
OpenAccessLink https://arxiv.org/abs/2410.10139
ParticipantIDs arxiv_primary_2410_10139
PublicationCentury 2000
PublicationDate 2024-10-14
PublicationDateYYYYMMDD 2024-10-14
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-10-14
  day: 14
PublicationDecade 2020
PublicationYear 2024
Score 3.870409
SecondaryResourceType preprint
Snippet Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
Title MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
URI https://arxiv.org/abs/2410.10139
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LTwIxEJ4AJy9GowYf6By8Nkr3YcpNDAiGxYuavZhN67bByMMsSPj5TmfXyIVrO2kmfX1fp_MAuJaGMMO5SBidGxESoRXG3d4J4v42jB0hFpd7S8bx4DV8SqO0BvgXC6OLzee6zA9sljcEL_y-DFQd6lJ6l63H57T8nORUXJX8vxxxTG7aAon-AexX7A7vy-U4hJqdH8F7kgx7HUyIptLVghzyOlvkJMfmuKnVa5ujP5iFnXh_8sUcu7R5JjNdfCGRShx5d2184zBwMaosjOjLmE2Xx3DV7708DAQrlH2X2SMyr2vGugYn0KA3vm0CBrFtO-X5lo2IwEitTESwoZX60D577Ck0d41ytrvrHPYkYbC_atvhBTRWxY9tEYauzCVP5C8hunWF
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MMIE%3A+Massive+Multimodal+Interleaved+Comprehension+Benchmark+for+Large+Vision-Language+Models&rft.au=Xia%2C+Peng&rft.au=Han%2C+Siwei&rft.au=Qiu%2C+Shi&rft.au=Zhou%2C+Yiyang&rft.date=2024-10-14&rft_id=info:doi/10.48550%2Farxiv.2410.10139&rft.externalDocID=2410_10139