Generative Multimodal Models are In-Context Learners
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal...
Saved in:
Published in | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 14398 - 14409 |
---|---|
Main Authors | , , , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
16.06.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 1063-6919 |
DOI | 10.1109/CVPR52733.2024.01365 |
Cover
Abstract | The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research. |
---|---|
AbstractList | The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research. |
Author | Wang, Xinlong Liu, Jingjing Zhang, Xiaosong Yu, Qiying Sun, Quan Rao, Yongming Huang, Tiejun Cui, Yufeng Wang, Yueze Zhang, Fan |
Author_xml | – sequence: 1 givenname: Quan surname: Sun fullname: Sun, Quan organization: Beijing Academy of Artificial Intelligence – sequence: 2 givenname: Yufeng surname: Cui fullname: Cui, Yufeng organization: Beijing Academy of Artificial Intelligence – sequence: 3 givenname: Xiaosong surname: Zhang fullname: Zhang, Xiaosong organization: Beijing Academy of Artificial Intelligence – sequence: 4 givenname: Fan surname: Zhang fullname: Zhang, Fan organization: Beijing Academy of Artificial Intelligence – sequence: 5 givenname: Qiying surname: Yu fullname: Yu, Qiying organization: Tsinghua University – sequence: 6 givenname: Yueze surname: Wang fullname: Wang, Yueze organization: Beijing Academy of Artificial Intelligence – sequence: 7 givenname: Yongming surname: Rao fullname: Rao, Yongming organization: Beijing Academy of Artificial Intelligence – sequence: 8 givenname: Jingjing surname: Liu fullname: Liu, Jingjing organization: Tsinghua University – sequence: 9 givenname: Tiejun surname: Huang fullname: Huang, Tiejun organization: Beijing Academy of Artificial Intelligence – sequence: 10 givenname: Xinlong surname: Wang fullname: Wang, Xinlong email: wangxinlong@baai.ac.cn organization: Beijing Academy of Artificial Intelligence |
BookMark | eNotj8FKw0AQQFdRsNb8QQ_5gdSZnZ3dzFGCtoUURdRr2bQTiKSJJFH07y3o6V0eD961uej6To1ZICwRQW6Lt6dntoFoacG6JSB5PjOJBMmJgZgA_LmZIXjKvKBcmWQc3wGALKKXfGbcSjsd4tR8abr9bKfm2B9im277g7ZjGgdNN11W9N2k31NaahxO9nhjLuvYjpr8c25eH-5finVWPq42xV2ZNRj8lOWWKajloMrsEYK44CohYRsDRDkgB-f3jGArdURRuHJauf2-1tzWQnOz-Os2qrr7GJpjHH52pxtmsTn9As4sRyU |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR52733.2024.01365 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences |
EISBN | 9798350353006 |
EISSN | 1063-6919 |
EndPage | 14409 |
ExternalDocumentID | 10655928 |
Genre | orig-research |
GroupedDBID | 6IE 6IH 6IL 6IN AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP OCL RIE RIL RIO |
ID | FETCH-LOGICAL-i176t-82537e257ee5561079474b93952a70a9d15746c5102be433a95b4eb4ccfe82f93 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:00:49 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i176t-82537e257ee5561079474b93952a70a9d15746c5102be433a95b4eb4ccfe82f93 |
PageCount | 12 |
ParticipantIDs | ieee_primary_10655928 |
PublicationCentury | 2000 |
PublicationDate | 2024-June-16 |
PublicationDateYYYYMMDD | 2024-06-16 |
PublicationDate_xml | – month: 06 year: 2024 text: 2024-June-16 day: 16 |
PublicationDecade | 2020 |
PublicationTitle | Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0003211698 |
Score | 2.481233 |
Snippet | The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 14398 |
SubjectTerms | Adaptation models Benchmark testing Codes Computational modeling Computer vision Reviews Visualization |
Title | Generative Multimodal Models are In-Context Learners |
URI | https://ieeexplore.ieee.org/document/10655928 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXjdmOwr3XOxVA-liJXeyj4mINZU2gTEX-_uJq0oCN5CLgk7-XZmNvN9H0I3BdXKSaOJ4NYSroUiWhlJHLU2AwdpauO0xUSOZ_xhLuYtWT1yYQAgDp9BEi7jv3y3snU4KvMIl74ApoMO6vjvrCFr7Q5UmG9lpBq09LgsVbfD5-lj0Bdjvg2kPAnqZOKHiUrMIaMemmyf3oyOvCZ1ZRL7-UuY8d-vd4D633Q9PN0lokO0B-UR6rX1JW7RuzlGvNGYDhscjsTbt5XTSxzs0JYbrNeA70sS5ao-KhyFV31t2Eez0d3TcExa1wTykuWyIr7lYzl4JAIE68vUAy7nRjElqM5TH5lM5Fxaj0VqgDOmlTAcjA9VAQNaKHaCuuWqhFOEC-dsSgtW-B2Ta610MIQxxgbVMlEYc4b6YRUW740wxmK7AOd_3L9A-yESYdIqk5eoW61ruPI5vTLXMZZfKj6ggw |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1aD3qqHxW_zcFr6m6-tjkXS6u1FGmlt5JkZ0GsW2m3IP56k-y2oiB4W_ayIcPLzGTnvYfQTUa1SqXRRHBrCddCEa2MJCm1NoYUosiGaYuB7I75_URMKrJ64MIAQBg-g6Z_DP_y07ld-asyh3DpCmDa2kY7LvFzUdK1NlcqzDUzUrUqglwcqdv28_DJK4wx1whS3vT6ZOKHjUrIIp06Gqy_Xw6PvDZXhWnaz1_SjP9e4D5qfBP28HCTig7QFuSHqF5VmLjC7_II8VJl2h9xOFBv3-apnmFviDZbYr0A3MtJEKz6KHCQXnXVYQONO3ejdpdUvgnkJU5kQVzTxxJwWATw5peRg1zCjWJKUJ1ELjaxSLi0Do3UAGdMK2E4GBesDFo0U-wY1fJ5DicIZ2lqI5qxzJ2ZXGulvSWMMdbrlonMmFPU8LswfS-lMabrDTj74_012u2OHvvTfm_wcI72fFT83FUsL1CtWKzg0mX4wlyFuH4Bxjmj0A |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Generative+Multimodal+Models+are+In-Context+Learners&rft.au=Sun%2C+Quan&rft.au=Cui%2C+Yufeng&rft.au=Zhang%2C+Xiaosong&rft.au=Zhang%2C+Fan&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=14398&rft.epage=14409&rft_id=info:doi/10.1109%2FCVPR52733.2024.01365&rft.externalDocID=10655928 |