EVA-02: A visual representation for neon genesis

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CL...

Full description

Saved in:
Bibliographic Details
Published inImage and vision computing Vol. 149; p. 105171
Main Authors Fang, Yuxin, Sun, Quan, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, Cao, Yue
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.09.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304 M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1 K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1 K, outperforming the previous largest & best open-sourced CLIP with only ∼1/6 parameters and ∼ 1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6 M to 304 M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02. •EVA-02, a plain Transformer-based visual representation, demonstrates superior performance in various vision tasks.•EVA-02 reduces model size through robust optimization, advanced activation functions, and position embedding.•EVA-02 achieves 90.0 fine-tuning top-1 accuracy on ImageNet-1K with only 304 M parameters.•EVA-02-CLIP outperforms the best open-sourced CLIP in zero-shot ImageNet-1K classification, using less training data.
ISSN:0262-8856
DOI:10.1016/j.imavis.2024.105171