PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer i...

Full description

Saved in:

Bibliographic Details
Published in	Transactions of the Association for Computational Linguistics Vol. 9; pp. 1098 - 1115
Main Authors	Lewis, Patrick, Wu, Yuxiang, Liu, Linqing, Minervini, Pasquale, Küttler, Heinrich, Piktus, Aleksandra, Stenetorp, Pontus, Riedel, Sebastian
Format	Journal Article
Language	English
Published	One Rogers Street, Cambridge, MA 02142-1209, USA MIT Press 07.10.2021 MIT Press Journals, The The MIT Press
Subjects	Accuracy Computational linguistics Question answer sequences Questions Testing time
Online Access	Get full text
ISSN	2307-387X 2307-387X
DOI	10.1162/tacl_a_00415

Cover

Loading…

More Information
Summary:	Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ and test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at , abstaining from answering when it is likely to be incorrect. This enables RePAQ to “back-off” to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.
Bibliography:	2021 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2307-387X 2307-387X
DOI:	10.1162/tacl_a_00415