CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP Performance on Low-Resource Languages

This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack lin...

Full description

Saved in:

Bibliographic Details
Main Authors	Santos, Gabriel Oliveira dos, Moreira, Diego A. B, Ferreira, Alef Iury, Silva, Jhessica, Pereira, Luiz, Bueno, Pedro, Sousa, Thiago, Maia, Helena, Da Silva, Nádia, Colombini, Esther, Pedrini, Helio, Avila, Sandra
Format	Journal Article
Language	English
Published	20.10.2023
Subjects	Computer Science - Learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This work introduces CAPIVARA, a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages. While CLIP has excelled in zero-shot vision-language tasks, the resource-intensive nature of model training remains challenging. Many datasets lack linguistic diversity, featuring solely English descriptions for images. CAPIVARA addresses this by augmenting text data using image captioning and machine translation to generate multiple synthetic captions in low-resource languages. We optimize the training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the computational cost. Through extensive experiments, CAPIVARA emerges as state of the art in zero-shot tasks involving images and Portuguese texts. We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a single GPU for 2 hours. Our model and code is available at https://github.com/hiaac-nlp/CAPIVARA.
DOI:	10.48550/arxiv.2310.13683