Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, includ...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Doan, Khang T, Huynh, Bao G, Hoang, Dung T, Pham, Thuc D, Pham, Nhat H, Nguyen, Quan T M, Vo, Bang Q, Hoang, Suong N
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 23.08.2024
Subjects	Datasets Large language models Optical character recognition Questions
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
ISSN:	2331-8422