Indexing Shared Content in Information Retrieval Systems

Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where re...

Full description

Saved in:

Bibliographic Details
Published in	Advances in Database Technology - EDBT 2006 pp. 313 - 330
Main Authors	Broder, Andrei Z., Eiron, Nadav, Fontoura, Marcus, Herscovici, Michael, Lempel, Ronny, McPherson, John, Qi, Runping, Shekita, Eugene
Format	Book Chapter Conference Proceeding
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2006 Springer
Series	Lecture Notes in Computer Science
Subjects	Applied sciences Computer science; control theory; systems Exact sciences and technology Information Retrieval System Information systems. Data bases Memory organisation. Data processing Query Evaluation Query Performance Query Term Representation Model Software Electronic discussion group Inverted file Database query Information system Information retrieval Text Electronic mail Modeling Document retrieval system World wide web Document structure Database Content-based retrieval Content analysis Indexing Execution time
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Modern document collections often contain groups of documents with overlapping or shared content. However, most information retrieval systems process each document separately, causing shared content to be indexed multiple times. In this paper, we describe a new document representation model where related documents are organized as a tree, allowing shared content to be indexed just once. We show how this representation model can be encoded in an inverted index and we describe algorithms for evaluating free-text queries based on this encoding. We also show how our representation model applies to web, email, and newsgroup search. Finally, we present experimental results showing that our methods can provide a significant reduction in the size of an inverted index as well as in the time to build and query it.
ISBN:	3540329609 9783540329602
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11687238_21