INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multi...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
29.11.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The performance differential of large language models (LLM) between languages
hinders their effective deployment in many regions, inhibiting the potential
economic and societal value of generative AI tools in many communities.
However, the development of functional LLMs in many languages (\ie,
multilingual LLMs) is bottlenecked by the lack of high-quality evaluation
resources in languages other than English. Moreover, current practices in
multilingual benchmark construction often translate English resources, ignoring
the regional and cultural knowledge of the environments in which multilingual
systems would be used. In this work, we construct an evaluation suite of
197,243 QA pairs from local exam sources to measure the capabilities of
multilingual LLMs in a variety of regional contexts. Our novel resource,
INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across
44 written languages that evaluates multilingual LLMs for performance in the
actual language environments where they would be deployed. |
---|---|
DOI: | 10.48550/arxiv.2411.19799 |