Generative Multimodal Models are In-Context Learners

The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 14398 - 14409
Main Authors Sun, Quan, Cui, Yufeng, Zhang, Xiaosong, Zhang, Fan, Yu, Qiying, Wang, Yueze, Rao, Yongming, Liu, Jingjing, Huang, Tiejun, Wang, Xinlong
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.06.2024
Subjects
Online AccessGet full text
ISSN1063-6919
DOI10.1109/CVPR52733.2024.01365

Cover

Abstract The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.
AbstractList The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.
Author Wang, Xinlong
Liu, Jingjing
Zhang, Xiaosong
Yu, Qiying
Sun, Quan
Rao, Yongming
Huang, Tiejun
Cui, Yufeng
Wang, Yueze
Zhang, Fan
Author_xml – sequence: 1
  givenname: Quan
  surname: Sun
  fullname: Sun, Quan
  organization: Beijing Academy of Artificial Intelligence
– sequence: 2
  givenname: Yufeng
  surname: Cui
  fullname: Cui, Yufeng
  organization: Beijing Academy of Artificial Intelligence
– sequence: 3
  givenname: Xiaosong
  surname: Zhang
  fullname: Zhang, Xiaosong
  organization: Beijing Academy of Artificial Intelligence
– sequence: 4
  givenname: Fan
  surname: Zhang
  fullname: Zhang, Fan
  organization: Beijing Academy of Artificial Intelligence
– sequence: 5
  givenname: Qiying
  surname: Yu
  fullname: Yu, Qiying
  organization: Tsinghua University
– sequence: 6
  givenname: Yueze
  surname: Wang
  fullname: Wang, Yueze
  organization: Beijing Academy of Artificial Intelligence
– sequence: 7
  givenname: Yongming
  surname: Rao
  fullname: Rao, Yongming
  organization: Beijing Academy of Artificial Intelligence
– sequence: 8
  givenname: Jingjing
  surname: Liu
  fullname: Liu, Jingjing
  organization: Tsinghua University
– sequence: 9
  givenname: Tiejun
  surname: Huang
  fullname: Huang, Tiejun
  organization: Beijing Academy of Artificial Intelligence
– sequence: 10
  givenname: Xinlong
  surname: Wang
  fullname: Wang, Xinlong
  email: wangxinlong@baai.ac.cn
  organization: Beijing Academy of Artificial Intelligence
BookMark eNotj8FKw0AQQFdRsNb8QQ_5gdSZnZ3dzFGCtoUURdRr2bQTiKSJJFH07y3o6V0eD961uej6To1ZICwRQW6Lt6dntoFoacG6JSB5PjOJBMmJgZgA_LmZIXjKvKBcmWQc3wGALKKXfGbcSjsd4tR8abr9bKfm2B9im277g7ZjGgdNN11W9N2k31NaahxO9nhjLuvYjpr8c25eH-5finVWPq42xV2ZNRj8lOWWKajloMrsEYK44CohYRsDRDkgB-f3jGArdURRuHJauf2-1tzWQnOz-Os2qrr7GJpjHH52pxtmsTn9As4sRyU
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52733.2024.01365
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350353006
EISSN 1063-6919
EndPage 14409
ExternalDocumentID 10655928
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i176t-82537e257ee5561079474b93952a70a9d15746c5102be433a95b4eb4ccfe82f93
IEDL.DBID RIE
IngestDate Wed Aug 27 02:00:49 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-82537e257ee5561079474b93952a70a9d15746c5102be433a95b4eb4ccfe82f93
PageCount 12
ParticipantIDs ieee_primary_10655928
PublicationCentury 2000
PublicationDate 2024-June-16
PublicationDateYYYYMMDD 2024-06-16
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-16
  day: 16
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.481233
Snippet The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems...
SourceID ieee
SourceType Publisher
StartPage 14398
SubjectTerms Adaptation models
Benchmark testing
Codes
Computational modeling
Computer vision
Reviews
Visualization
Title Generative Multimodal Models are In-Context Learners
URI https://ieeexplore.ieee.org/document/10655928
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXjdmOwr3XOxVA-liJXeyj4mINZU2gTEX-_uJq0oCN5CLgk7-XZmNvN9H0I3BdXKSaOJ4NYSroUiWhlJHLU2AwdpauO0xUSOZ_xhLuYtWT1yYQAgDp9BEi7jv3y3snU4KvMIl74ApoMO6vjvrCFr7Q5UmG9lpBq09LgsVbfD5-lj0Bdjvg2kPAnqZOKHiUrMIaMemmyf3oyOvCZ1ZRL7-UuY8d-vd4D633Q9PN0lokO0B-UR6rX1JW7RuzlGvNGYDhscjsTbt5XTSxzs0JYbrNeA70sS5ao-KhyFV31t2Eez0d3TcExa1wTykuWyIr7lYzl4JAIE68vUAy7nRjElqM5TH5lM5Fxaj0VqgDOmlTAcjA9VAQNaKHaCuuWqhFOEC-dsSgtW-B2Ta610MIQxxgbVMlEYc4b6YRUW740wxmK7AOd_3L9A-yESYdIqk5eoW61ruPI5vTLXMZZfKj6ggw
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1aD3qqHxW_zcFr6m6-tjkXS6u1FGmlt5JkZ0GsW2m3IP56k-y2oiB4W_ayIcPLzGTnvYfQTUa1SqXRRHBrCddCEa2MJCm1NoYUosiGaYuB7I75_URMKrJ64MIAQBg-g6Z_DP_y07ld-asyh3DpCmDa2kY7LvFzUdK1NlcqzDUzUrUqglwcqdv28_DJK4wx1whS3vT6ZOKHjUrIIp06Gqy_Xw6PvDZXhWnaz1_SjP9e4D5qfBP28HCTig7QFuSHqF5VmLjC7_II8VJl2h9xOFBv3-apnmFviDZbYr0A3MtJEKz6KHCQXnXVYQONO3ejdpdUvgnkJU5kQVzTxxJwWATw5peRg1zCjWJKUJ1ELjaxSLi0Do3UAGdMK2E4GBesDFo0U-wY1fJ5DicIZ2lqI5qxzJ2ZXGulvSWMMdbrlonMmFPU8LswfS-lMabrDTj74_012u2OHvvTfm_wcI72fFT83FUsL1CtWKzg0mX4wlyFuH4Bxjmj0A
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=Generative+Multimodal+Models+are+In-Context+Learners&rft.au=Sun%2C+Quan&rft.au=Cui%2C+Yufeng&rft.au=Zhang%2C+Xiaosong&rft.au=Zhang%2C+Fan&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=14398&rft.epage=14409&rft_id=info:doi/10.1109%2FCVPR52733.2024.01365&rft.externalDocID=10655928