NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark
In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the sa...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , , , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
27.10.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination. |
---|---|
AbstractList | In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a target benchmark and associated task with respect to their non-contaminated counterparts. The consequences can be very harmful, with wrong scientific conclusions being published while other correct ones are discarded. This position paper defines different levels of data contamination and argues for a community effort, including the development of automatic and semi-automatic measures to detect when data from a benchmark was exposed to a model, and suggestions for flagging papers with conclusions that are compromised by data contamination. |
Author | García-Ferrero, Iker Oier Lopez de Lacalle Sainz, Oscar Etxaniz, Julen Agirre, Eneko Jon Ander Campos |
Author_xml | – sequence: 1 givenname: Oscar surname: Sainz fullname: Sainz, Oscar – sequence: 2 fullname: Jon Ander Campos – sequence: 3 givenname: Iker surname: García-Ferrero fullname: García-Ferrero, Iker – sequence: 4 givenname: Julen surname: Etxaniz fullname: Etxaniz, Julen – sequence: 5 fullname: Oier Lopez de Lacalle – sequence: 6 givenname: Eneko surname: Agirre fullname: Agirre, Eneko |
BookMark | eNqNi80OwUAURidCoug73MS6Sc0oZakqFvWzsLOQq27T0s4wP55fEx7A6jvJOd-AdaWS1GEeF2ISxFPO-8w35h6GIZ_NeRQJj5332RHSN9YObaUkVBKsVu5a0xIOLZcEe6IbWAU7QuM0QZbtYI0WIVHSYlPJ77NQGgjzElYk87JB_RixXoG1If-3QzbepKdkGzy1ejky9nJXTstWXXgci8WC82ks_qs-74VDIQ |
ContentType | Paper |
Copyright | 2023. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2023. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Database (Proquest) ProQuest Central (Alumni) ProQuest Central ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One Community College ProQuest Central SciTech Premium Collection ProQuest Engineering Collection Engineering Database Access via ProQuest (Open Access) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
ID | FETCH-proquest_journals_28839922483 |
IEDL.DBID | 8FG |
IngestDate | Wed Sep 25 00:08:30 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-proquest_journals_28839922483 |
OpenAccessLink | https://www.proquest.com/docview/2883992248/abstract/?pq-origsite=%requestingapplication% |
PQID | 2883992248 |
PQPubID | 2050157 |
ParticipantIDs | proquest_journals_2883992248 |
PublicationCentury | 2000 |
PublicationDate | 20231027 |
PublicationDateYYYYMMDD | 2023-10-27 |
PublicationDate_xml | – month: 10 year: 2023 text: 20231027 day: 27 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2023 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 3.4899702 |
SecondaryResourceType | preprint |
Snippet | In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Benchmarks Contamination Large language models Natural language processing Position measurement |
Title | NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark |
URI | https://www.proquest.com/docview/2883992248/abstract/ |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LS8NAEB5qg-DNJz5qGdBriHk0Dy9CNTFIEoMoFDyUzWZDRU1qml797e7Eph6E3nZZdtldhpnZ2W_mA7jkpsvsUeGpRe5xVXrgQvWuuK2KwmC5x3LHaXnI4sQOX6yHyWjSg7DLhSFYZacTW0WdV5xi5Bqx4lINVcvVWEZRAN5oN_Mvlfij6J91RaaxBYpONfEoZzy4X0dbDNuRvrP5T-G2ViTYBSVlc1HvQU-U-7Ddgi_54gBekyhFf112G99KbOpqmX2Ia3yU7ZnARBoZbCqMf0N6GEUx3rGGIVWXYgRnaWdKDxQJHoljKXyzT1a_H8JF4D_fhmq3o-lKehbTv7OaR9Avq1IcAzIjM_IRd_PM8CzP5m4mnS3B6f2nM4vpJzDYtNLp5uEz2CEiddLKhjOAflMvxbk0t002bG9yCMrYT9In2Yu__R_FkIrE |
link.rule.ids | 786,790,12792,21416,33408,33779,43635,43840 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3PS8MwFH7ohujNn_hj6gO9Bl3atY0XQV2t2tYdJgw8lDRN2VDX2Xb_v0ns5kHYLRASkvB435eXl_cBXArL404vZyTPmCCKgUvCroVDZE55xnjmukaHLIqd4M1-HvVGTcCtatIqFz7ROOqsEDpGfqVVcXUNVdu7nX0TrRqlX1cbCY11aNuWgk79U9x_XMZYqOMqxmz9c7MGO_xtaA_4TJY7sCanu7BhUi5FtQfvcTjA_rLYNk6mWJfFPP2UN_iq2mOJsYIWrAuMfgN5GIYRPvCao64pxXUSixmpeCfqpEi8UyY3_uLlxz5c-P3hfUAWK0oam6mSvx1aB9BSl395CMhpSrOe8LKUMps5wksVxZJC3_q63ObdI-ismul4dfc5bAbDKEzCp_jlBLa0lLr2y9TtQKsu5_JUAW6dnplT_QEfhocn |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=NLP+Evaluation+in+trouble%3A+On+the+Need+to+Measure+LLM+Data+Contamination+for+each+Benchmark&rft.jtitle=arXiv.org&rft.au=Sainz%2C+Oscar&rft.au=Jon+Ander+Campos&rft.au=Garc%C3%ADa-Ferrero%2C+Iker&rft.au=Etxaniz%2C+Julen&rft.date=2023-10-27&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422 |