Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEIT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from th...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 19175 - 19186
Main Authors	Wang, Wenhui, Bao, Hangbo, Dong, Li, Bjorck, Johan, Peng, Zhiliang, Liu, Qiang, Aggarwal, Kriti, Mohammed, Owais Khan, Singhal, Saksham, Som, Subhojit, Wei, Furu
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2023
Subjects	Computer architecture language Object detection Pattern recognition Question answering (information retrieval) reasoning Semantic segmentation Transformers Vision Visualization
Online Access	Get full text
ISSN	1063-6919
DOI	10.1109/CVPR52729.2023.01838

Cover

Loading…

More Information
Summary:	A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEIT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEIT-3 obtains remarkable performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
ISSN:	1063-6919
DOI:	10.1109/CVPR52729.2023.01838