Self-indexing Based on LZ77

We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases...

Full description

Saved in:

Bibliographic Details
Published in	Combinatorial Pattern Matching pp. 41 - 54
Main Authors	Kreft, Sebastian, Navarro, Gonzalo
Format	Book Chapter
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg
Series	Lecture Notes in Computer Science
Subjects	Binary Search Phrase Boundary Reverse Trie Software Repository Text Collection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We introduce the first self-index based on the Lempel-Ziv 1977 compression format (LZ77). It is particularly competitive for highly repetitive text collections such as sequence databases of genomes of related species, software repositories, versioned document collections, and temporal text databases. Such collections are extremely compressible but classical self-indexes fail to capture that source of compressibility. Our self-index takes in practice a few times the space of the text compressed with LZ77 (as little as 2.5 times), extracts 1–2 million characters of the text per second, and finds patterns at a rate of 10–50 microseconds per occurrence. It is smaller (up to one half) than the best current self-index for repetitive collections, and faster in many cases.
Bibliography:	Partially funded by Millennium Institute for Cell Dynamics and Biotechnology (ICDB), Grant ICM P05-001-F, Mideplan, Chile and, the first author, by Conicyt’s Master Scholarship.
ISBN:	9783642214578 3642214576
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-642-21458-5_6