Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

A key limitation in current datasets for is that the required steps for answering the question are mentioned in it . In this work, we introduce S QA, a question answering (QA) benchmark where the required reasoning steps are in the question, and should be inferred using a . A fundamental challenge i...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 9; pp. 346 - 361
Main Authors	Geva, Mor, Khashabi, Daniel, Segal, Elad, Khot, Tushar, Roth, Dan, Berant, Jonathan
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 01.01.2021 MIT Press Journals, The The MIT Press
Subjects	Benchmarks Computational linguistics Creativity Crowdsourcing Data collection Decomposition Helium Linguistics Paragraphs Priming Question answer sequences Questions Reasoning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A key limitation in current datasets for is that the required steps for answering the question are mentioned in it . In this work, we introduce S QA, a question answering (QA) benchmark where the required reasoning steps are in the question, and should be inferred using a . A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies. We propose a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts. Moreover, we annotate each question with (1) a decomposition into reasoning steps for answering it, and (2) Wikipedia paragraphs that contain the answers to each step. Overall, S QA includes 2,780 examples, each consisting of a strategy question, its decomposition, and evidence paragraphs. Analysis shows that questions in S QA are short, topic-diverse, and cover a wide range of strategies. Empirically, we show that humans perform well (87%) on this task, while our best baseline reaches an accuracy of ∼ 66 .
Bibliography:	2021 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00370