INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multi...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Romanou, Angelika, outan, Negar, Sotnikova, Anna, Chen, Zeming, Nelaturu, Sree Harsha, Singh, Shivalika, Maheshwary, Rishabh, Altomare, Micol, Haggag, Mohamed A, Snegha, A, Amayuelas, Alfonso, Azril Hafizi Amirudin, Aryabumi, Viraat, Boiko, Danylo, Chang, Michael, Chim, Jenny, Cohen, Gal, Dalmia, Aditya Kumar, Diress, Abraham, Duwal, Sharad, Dzenhaliou, Daniil, Daniel Fernando Erazo Florez, Farestam, Fabian, Imperial, Joseph Marvin, Shayekh Bin Islam, Isotalo, Perttu, Jabbarishiviari, Maral, Karlsson, Börje F, Khalilov, Eldar, Klamm, Christopher, Koto, Fajri, Krzemiński, Dominik, de Melo, Gabriel Adriano, Montariol, Syrielle, Yiyang Nan, Niklaus, Joel, Novikova, Jekaterina, Johan Samir Obando Ceron, Debjit, Paul, Ploeger, Esther, Purbey, Jebish, Rajwal, Swati, Selvan Sunitha Ravi, Rydell, Sara, Roshan Santhosh, Sharma, Drishti, Skenduli, Marjana Prifti, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Tamir, Ran, Tarun, Ayush Kumar, Wasi, Azmine Toushik, Weerasinghe, Thenuka Ovin, Yilmaz, Serhan, Zhang, Mike, Schlag, Imanol, Fadaee, Marzieh, Hooker, Sara, Bosselut, Antoine
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 29.11.2024
Subjects	Benchmarks English language Generative artificial intelligence Large language models Non-English languages Performance evaluation Quality assessment Regional development
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.
ISSN:	2331-8422