VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning

Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning...

Full description

Saved in:

Bibliographic Details
Published in	Applied sciences Vol. 14; no. 3; p. 1169
Main Authors	Ma, Han, Fan, Baoyu, Ng, Benjamin K, Lam, Chan-Tong
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.01.2024
Subjects	few-shot learning Language Learning Medical imaging equipment meta learning multimodal learning representation alignment Semantics vision language learning visual question answering
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning. Therefore, we propose VL-Few, which is a simple and effective method to solve the multimodal few-shot problem. VL-Few (1) proposes the modal alignment, which aligns visual features into language space through a lightweight model network and improves the multimodal understanding ability of the model; (2) adopts few-shot meta learning in the multimodal problem, which constructs a few-shot meta task pool to improve the generalization ability of the model; (3) proposes semantic alignment to enhance the semantic understanding ability of the model for the task, context, and demonstration; (4) proposes task alignment that constructs training data into the target task form and improves the task understanding ability of the model; (5) proposes generation alignment, which adopts the token-level training and multitask fusion loss to improve the generation ability of the model. Our experimental results show the effectiveness of VL-Few for multimodal few-shot problems.
ISSN:	2076-3417 2076-3417
DOI:	10.3390/app14031169