Computing MEMs and Relatives on Repetitive Text Collections

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern \(P[1\mathinner{.. }m]\) on a large repetitive text collection \(T[1\mathinner{.. }n]\) over an alphabet of size \(\sigma\) , which is represented as a (hopefully much smaller) run-length context-free grammar of...

Full description

Saved in:
Bibliographic Details
Published inACM transactions on algorithms Vol. 21; no. 1; pp. 1 - 33
Main Author Navarro, Gonzalo
Format Journal Article
LanguageEnglish
Published New York, NY ACM 17.12.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern \(P[1\mathinner{.. }m]\) on a large repetitive text collection \(T[1\mathinner{.. }n]\) over an alphabet of size \(\sigma\) , which is represented as a (hopefully much smaller) run-length context-free grammar of size \(g_{rl}\) . We show that the problem can be solved in time \(O(m^{2}\log^{\epsilon}n)\) , for any constant \(\epsilon\,{\gt}\,0\) , on a data structure of size \(O(g_{rl})\) . Further, on a locally consistent grammar of size \(O(\delta\log\frac{n\log\sigma}{\delta\log n})\) , the time decreases to \(O(m\log m(\log m+\log^{\epsilon}n))\) . The value \(\delta\) is a function of the substring complexity of \(T\) and \(\Omega(\delta\log\frac{n\log\sigma}{\delta\log n})\) is a tight lower bound on the compressibility of repetitive texts \(T\) , so our structure has optimal size in terms of \(n\) , \(\sigma\) , and \(\delta\) . We extend our results to several related problems, such as finding \(k\) -MEMs, MUMs, rare MEMs, and applications. Categories and Subject Descriptors: E.1 [Data structures]; E.2 [Data storage representations]; E.4 [Coding and information theory]: Data compaction and compression; F.2.2 [Analysis of algorithms and problem complexity]: Nonnumerical algorithms and problems—Pattern matching, Computations on discrete structures, Sorting and searching.
ISSN:1549-6325
1549-6333
DOI:10.1145/3701561