Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index
Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of ge...
Saved in:
Published in | Journal of computational biology Vol. 32; no. 9; pp. 865 - 878 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
United States
Mary Ann Liebert, Inc., publishers
01.09.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 1557-8666 1557-8666 |
DOI | 10.1089/cmb.2024.0925 |
Cover
Loading…
Summary: | Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows–Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (
LF
) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant
LF
mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming
LF
mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm’s effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at
https://github.com/Lilu-guo/Effloc
. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1557-8666 1557-8666 |
DOI: | 10.1089/cmb.2024.0925 |