C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
New NLP benchmarks are urgently needed to align with the rapid development of large language models (LLMs). We present C-Eval, the first comprehensive Chinese evaluation suite designed to assess advanced knowledge and reasoning abilities of foundation models in a Chinese context. C-Eval comprises mu...
Saved in:
Main Authors | , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.05.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | New NLP benchmarks are urgently needed to align with the rapid development of
large language models (LLMs). We present C-Eval, the first comprehensive
Chinese evaluation suite designed to assess advanced knowledge and reasoning
abilities of foundation models in a Chinese context. C-Eval comprises
multiple-choice questions across four difficulty levels: middle school, high
school, college, and professional. The questions span 52 diverse disciplines,
ranging from humanities to science and engineering. C-Eval is accompanied by
C-Eval Hard, a subset of very challenging subjects in C-Eval that requires
advanced reasoning abilities to solve. We conduct a comprehensive evaluation of
the most advanced LLMs on C-Eval, including both English- and Chinese-oriented
models. Results indicate that only GPT-4 could achieve an average accuracy of
over 60%, suggesting that there is still significant room for improvement for
current LLMs. We anticipate C-Eval will help analyze important strengths and
shortcomings of foundation models, and foster their development and growth for
Chinese users. |
---|---|
DOI: | 10.48550/arxiv.2305.08322 |