CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages
This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack lin...
Saved in:
Main Authors | , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
20.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This work introduces CAPIVARA, a cost-efficient framework designed to enhance
the performance of multilingual CLIP models in low-resource languages. While
CLIP has excelled in zero-shot vision-language tasks, the resource-intensive
nature of model training remains challenging. Many datasets lack linguistic
diversity, featuring solely English descriptions for images. CAPIVARA addresses
this by augmenting text data using image captioning and machine translation to
generate multiple synthetic captions in low-resource languages. We optimize the
training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the
computational cost. Through extensive experiments, CAPIVARA emerges as state of
the art in zero-shot tasks involving images and Portuguese texts. We show the
potential for significant improvements in other low-resource languages,
achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a
single GPU for 2 hours. Our model and code is available at
https://github.com/hiaac-nlp/CAPIVARA. |
---|---|
DOI: | 10.48550/arxiv.2310.13683 |