Fewer topics? A million topics? Both?! On topics subsets in test collections

When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections...

Full description

Saved in:

Bibliographic Details
Published in	Information retrieval (Boston) Vol. 23; no. 1; pp. 49 - 85
Main Authors	Roitero, Kevin, Culpepper, J. Shane, Sanderson, Mark, Scholer, Falk, Mizzaro, Stefano
Format	Journal Article
Language	English
Published	Dordrecht Springer Netherlands 01.02.2020 Springer Nature B.V
Subjects	Clustering Collection Collections Computer Science Data Mining and Knowledge Discovery Data Structures and Information Theory Effectiveness Evaluation Information retrieval Information Storage and Retrieval Natural Language Processing (NLP) Pattern Recognition Retrieval performance measures Statistical significance Retrieval evaluation Statistical significance Topic clustering Few topics
Online Access	Get full text

Cover

Loading…

More Information
Summary:	When evaluating IR run effectiveness using a test collection, a key question is: What search topics should be used? We explore what happens to measurement accuracy when the number of topics in a test collection is reduced, using the Million Query 2007, TeraByte 2006, and Robust 2004 TREC collections, which all feature more than 50 topics, something that has not been examined in past work. Our analysis finds that a subset of topics can be found that is as accurate as the full topic set at ranking runs. Further, we show that the size of the subset, relative to the full topic set, can be substantially smaller than was shown in past work. We also study the topic subsets in the context of the power of statistical significance tests. We find that there is a trade off with using such sets in that significant results may be missed, but the loss of statistical significance is much smaller than when selecting random subsets. We also find topic subsets that can result in a low accuracy test collection, even when the number of queries in the subset is quite large. These negatively correlated subsets suggest we still lack good methodologies which provide stability guarantees on topic selection in new collections. Finally, we examine whether clustering of topics is an appropriate strategy to find and characterize good topic subsets. Our results contribute to the understanding of information retrieval effectiveness evaluation, and offer insights for the construction of test collections.
ISSN:	1386-4564 1573-7659
DOI:	10.1007/s10791-019-09357-w