Natural Questions: A Benchmark for Question Answering Research

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typic...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 7; pp. 453 - 466
Main Authors	Kwiatkowski, Tom, Palomaki, Jennimaria, Redfield, Olivia, Collins, Michael, Parikh, Ankur, Alberti, Chris, Epstein, Danielle, Polosukhin, Illia, Devlin, Jacob, Lee, Kenton, Toutanova, Kristina, Jones, Llion, Kelcey, Matthew, Chang, Ming-Wei, Dai, Andrew M., Uszkoreit, Jakob, Le, Quoc, Petrov, Slav
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.11.2019 MIT Press Journals, The The MIT Press
Subjects	Annotations Computational linguistics Computerized corpora Datasets Human performance Linguistics Machine translation Natural language Question answer sequences Questions Search engines Upper bounds Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.
Bibliography:	Volume, 2019 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00276