Sample Sizes for Query Probing in Uncooperative Distributed Information Retrieval

The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of th...

Full description

Saved in:
Bibliographic Details
Published inFrontiers of WWW Research and Development - APWeb 2006 pp. 63 - 75
Main Authors Shokouhi, Milad, Scholer, Falk, Zobel, Justin
Format Book Chapter Conference Proceeding
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg 2006
Springer
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The goal of distributed information retrieval is to support effective searching over multiple document collections. For efficiency, queries should be routed to only those collections that are likely to contain relevant documents, so it is necessary to first obtain information about the content of the target collections. In an uncooperative environment, query probing — where randomly-chosen queries are used to retrieve a sample of the documents and thus of the lexicon — has been proposed as a technique for estimating statistical term distributions. In this paper we rebut the claim that a sample of 300 documents is sufficient to provide good coverage of collection terms. We propose a novel sampling strategy and experimentally demonstrate that sample size needs to vary from collection to collection, that our methods achieve good coverage based on variable-sized samples, and that we can use the results of a probe to determine when to stop sampling.
ISBN:3540311424
9783540311423
ISSN:0302-9743
1611-3349
DOI:10.1007/11610113_7