COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally d...

Full description

Saved in:
Bibliographic Details
Main Authors Bai, Yuelin, Du, Xinrun, Liang, Yiming, Jin, Yonggang, Zhou, Junting, Liu, Ziqiang, Fang, Feiteng, Chang, Mingshan, Zheng, Tianyu, Zhang, Xincheng, Ma, Nuo, Wang, Zekun, Yuan, Ruibin, Wu, Haihong, Lin, Hongquan, Huang, Wenhao, Zhang, Jiajun, Lin, Chenghua, Fu, Jie, Yang, Min, Ni, Shiwen, Zhang, Ge
Format Journal Article
LanguageEnglish
Published 26.03.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
DOI:10.48550/arxiv.2403.18058