EVA-02: A visual representation for neon genesis

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CL...

Full description

Saved in:

Bibliographic Details
Published in	Image and vision computing Vol. 149; p. 105171
Main Authors	Fang, Yuxin, Sun, Quan, Wang, Xinggang, Huang, Tiejun, Wang, Xinlong, Cao, Yue
Format	Journal Article
Language	English
Published	Elsevier B.V 01.09.2024
Subjects	Foundation model Representation learning Vision transformer Representation learning Vision transformer Foundation model
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304 M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1 K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1 K, outperforming the previous largest & best open-sourced CLIP with only ∼1/6 parameters and ∼ 1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6 M to 304 M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community at https://github.com/baaivision/EVA/tree/master/EVA-02. •EVA-02, a plain Transformer-based visual representation, demonstrates superior performance in various vision tasks.•EVA-02 reduces model size through robust optimization, advanced activation functions, and position embedding.•EVA-02 achieves 90.0 fine-tuning top-1 accuracy on ImageNet-1K with only 304 M parameters.•EVA-02-CLIP outperforms the best open-sourced CLIP in zero-shot ImageNet-1K classification, using less training data.
ISSN:	0262-8856
DOI:	10.1016/j.imavis.2024.105171