The TechQA Dataset
We introduce TechQA, a domain-adaptation question answering dataset for the technical support domain. The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generate...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
07.11.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | We introduce TechQA, a domain-adaptation question answering dataset for the
technical support domain. The TechQA corpus highlights two real-world issues
from the automated customer support domain. First, it contains actual questions
posed by users on a technical forum, rather than questions generated
specifically for a competition or a task. Second, it has a real-world size --
600 training, 310 dev, and 490 evaluation question/answer pairs -- thus
reflecting the cost of creating large labeled datasets with actual data.
Consequently, TechQA is meant to stimulate research in domain adaptation rather
than being a resource to build QA systems from scratch. The dataset was
obtained by crawling the IBM Developer and IBM DeveloperWorks forums for
questions with accepted answers that appear in a published IBM Technote---a
technical document that addresses a specific technical issue. We also release a
collection of the 801,998 publicly available Technotes as of April 4, 2019 as a
companion resource that might be used for pretraining, to learn representations
of the IT domain language. |
---|---|
DOI: | 10.48550/arxiv.1911.02984 |