NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the sa...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Sainz, Oscar, Jon Ander Campos, García-Ferrero, Iker, Etxaniz, Julen, Oier Lopez de Lacalle, Agirre, Eneko
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 27.10.2023
Subjects	Benchmarks Contamination Large language models Natural language processing Position measurement
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!