Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index

Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of ge...

Full description

Saved in:
Bibliographic Details
Published inJournal of computational biology Vol. 32; no. 9; pp. 865 - 878
Main Author GUO, LI-LU
Format Journal Article
LanguageEnglish
Published United States Mary Ann Liebert, Inc., publishers 01.09.2025
Subjects
Online AccessGet full text
ISSN1557-8666
1557-8666
DOI10.1089/cmb.2024.0925

Cover

Loading…
More Information
Summary:Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows–Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column ( LF ) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant LF mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming LF mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm’s effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1557-8666
1557-8666
DOI:10.1089/cmb.2024.0925