COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally d...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
26.03.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Remarkable progress on English instruction tuning has facilitated the
efficacy and reliability of large language models (LLMs). However, there
remains a noticeable gap in instruction tuning for Chinese, where the complex
linguistic features pose significant challenges. Existing datasets, generally
distilled from English-centric LLMs, are not well-aligned with Chinese users'
interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese
instruction tuning dataset derived from various real-world resources and
undergoing rigorous human verification. We conduct extensive experiments on
COIG-CQIA, and compare them with strong baseline models and datasets. The
experimental results show that models trained on COIG-CQIA achieve highly
competitive performance in diverse benchmarks. Additionally, our findings offer
several insights for designing effective Chinese instruction-tuning datasets
and data-mixing strategies. Our dataset are available at
https://huggingface.co/datasets/m-a-p/COIG-CQIA. |
---|---|
DOI: | 10.48550/arxiv.2403.18058 |