MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks...

Full description

Saved in:

Bibliographic Details
Main Authors	Xia, Peng, Han, Siwei, Qiu, Shi, Zhou, Yiyang, Wang, Zhaoyang, Zheng, Wenhao, Chen, Zhaorun, Cui, Chenhang, Ding, Mingyu, Li, Linjie, Wang, Lijuan, Yao, Huaxiu
Format	Journal Article
Language	English
Published	14.10.2024
Subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online Access	Get full text

Cover

Loading…

Abstract	Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
AbstractList	Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.
Author	Qiu, Shi Wang, Zhaoyang Li, Linjie Yao, Huaxiu Chen, Zhaorun Zheng, Wenhao Han, Siwei Ding, Mingyu Xia, Peng Cui, Chenhang Zhou, Yiyang Wang, Lijuan
Author_xml	– sequence: 1 givenname: Peng surname: Xia fullname: Xia, Peng – sequence: 2 givenname: Siwei surname: Han fullname: Han, Siwei – sequence: 3 givenname: Shi surname: Qiu fullname: Qiu, Shi – sequence: 4 givenname: Yiyang surname: Zhou fullname: Zhou, Yiyang – sequence: 5 givenname: Zhaoyang surname: Wang fullname: Wang, Zhaoyang – sequence: 6 givenname: Wenhao surname: Zheng fullname: Zheng, Wenhao – sequence: 7 givenname: Zhaorun surname: Chen fullname: Chen, Zhaorun – sequence: 8 givenname: Chenhang surname: Cui fullname: Cui, Chenhang – sequence: 9 givenname: Mingyu surname: Ding fullname: Ding, Mingyu – sequence: 10 givenname: Linjie surname: Li fullname: Li, Linjie – sequence: 11 givenname: Lijuan surname: Wang fullname: Wang, Lijuan – sequence: 12 givenname: Huaxiu surname: Yao fullname: Yao, Huaxiu
BackLink	https://doi.org/10.48550/arXiv.2410.10139$$DView paper in arXiv
BookMark	eNqFjs0KgkAUhWdRi_4eoFX3BTRNhWpZGAnOLtqFXPKqQ-NMzJjU26fSvtXh_HD4pmyktCLGlr7nhtso8tZo3qJ1N2EX-J4f7CbsxnkS74GjtaIl4C_ZiFrnKCFRDRlJ2FIOR10_DVWkrNAKDqTuVY3mAYU2kKIpCa6ir5wUVfnCznOdk7RzNi5QWlr8dMZWp_hyPDsDSPY0orv5ZD1QNgAF_xdftoZCtw
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.2410.10139
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2410_10139
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2410_101393
IEDL.DBID	GOX
IngestDate	Wed Oct 16 12:30:13 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2410_101393
OpenAccessLink	https://arxiv.org/abs/2410.10139
ParticipantIDs	arxiv_primary_2410_10139
PublicationCentury	2000
PublicationDate	2024-10-14
PublicationDateYYYYMMDD	2024-10-14
PublicationDate_xml	– month: 10 year: 2024 text: 2024-10-14 day: 14
PublicationDecade	2020
PublicationYear	2024
Score	3.870409
SecondaryResourceType	preprint
Snippet	Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Title	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models
URI	https://arxiv.org/abs/2410.10139
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LTwIxEJ4AJy9GowYf6By8Nkr3YcpNDAiGxYuavZhN67bByMMsSPj5TmfXyIVrO2kmfX1fp_MAuJaGMMO5SBidGxESoRXG3d4J4v42jB0hFpd7S8bx4DV8SqO0BvgXC6OLzee6zA9sljcEL_y-DFQd6lJ6l63H57T8nORUXJX8vxxxTG7aAon-AexX7A7vy-U4hJqdH8F7kgx7HUyIptLVghzyOlvkJMfmuKnVa5ujP5iFnXh_8sUcu7R5JjNdfCGRShx5d2184zBwMaosjOjLmE2Xx3DV7708DAQrlH2X2SMyr2vGugYn0KA3vm0CBrFtO-X5lo2IwEitTESwoZX60D577Ck0d41ytrvrHPYkYbC_atvhBTRWxY9tEYauzCVP5C8hunWF
link.rule.ids	228,230,783,888
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MMIE%3A+Massive+Multimodal+Interleaved+Comprehension+Benchmark+for+Large+Vision-Language+Models&rft.au=Xia%2C+Peng&rft.au=Han%2C+Siwei&rft.au=Qiu%2C+Shi&rft.au=Zhou%2C+Yiyang&rft.date=2024-10-14&rft_id=info:doi/10.48550%2Farxiv.2410.10139&rft.externalDocID=2410_10139