A survey of 25 years of evaluation

Evaluation was not a thing when the first author was a graduate student in the late 1970s. There was an Artificial Intelligence (AI) boom then, but that boom was quickly followed by a bust and a long AI Winter. Charles Wayne restarted funding in the mid-1980s by emphasizing evaluation. No other sort...

Full description

Saved in:
Bibliographic Details
Published inNatural language engineering Vol. 25; no. 6; pp. 753 - 767
Main Authors Church, Kenneth Ward, Hestness, Joel
Format Journal Article
LanguageEnglish
Published Cambridge, UK Cambridge University Press 01.11.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Evaluation was not a thing when the first author was a graduate student in the late 1970s. There was an Artificial Intelligence (AI) boom then, but that boom was quickly followed by a bust and a long AI Winter. Charles Wayne restarted funding in the mid-1980s by emphasizing evaluation. No other sort of program could have been funded at the time, at least in America. His program was so successful that these days, shared tasks and leaderboards have become common place in speech and language (and Vision and Machine Learning). It is hard to remember that evaluation was a tough sell 25 years ago. That said, we may be a bit too satisfied with current state of the art. This paper will survey considerations from other fields such as reliability and validity from psychology and generalization from systems. There has been a trend for publications to report better and better numbers, but what do these numbers mean? Sometimes the numbers are too good to be true, and sometimes the truth is better than the numbers. It is one thing for an evaluation to fail to find a difference between man and machine, and quite another thing to pass the Turing Test. As Feynman said, “the first principle is that you must not fool yourself–and you are the easiest person to fool.”
ISSN:1351-3249
1469-8110
DOI:10.1017/S1351324919000275