DH_Aligner: A fast short-read aligner on multicore platforms with AVX vectorization

The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignme...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 205; p. 105142
Main Authors Qiao, Sun, Feng, Chen, Leisheng, Li, Huiyuan, Li
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.11.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignment makes an educated guess on where and how a read is mapped to a given reference sequence. In this paper, we propose DH_Aligner, a fast and accurate short read aligner designed and optimized for x86 multi-core platforms with avx2/avx512 SIMD instruction sets. It is based on a three-phased aligning work-flow: seeding-filtering-extension and provides an end-to-end solution for read alignment from Fastq to SAM files. Due to a fast seeding scheme and a seed filtering procedure, DH_Aligner can avoid both of a time-consuming seeding phase and redundant workload of aligning reads at seemingly wrong locations. With the introduction of batched-processing methodology, parallelism is easily exploited at data-, instruction- and thread-level. The performance-critical kernels in DH_Aligner are implemented by both avx2 and avx512 intrinsics for a better performance and portability. On two typical x86 based platforms: Intel Xeon-6154 and Hygon C86-7285, DH_Aligner can produce a near-best accuracy/sensitivity while outperform state-of-the-art parallel implementations with average speedup: 7.8x, 3.4x, 2.8x-6.7x and 1.5x over bwa-mem, bwa-mem2, bowtie2 and minimap2 respectively. •Based on BWT+FM-Index, fast and accurate seeding schemes are devised.•Batched processing methodology help to exploit types of parallelism.•Core kernels are vectorized by avx512/avx2 instructions with on-chip optimizations.•Several times faster than SOAT implementations with negligible drop in accuracy.
ISSN:0743-7315
DOI:10.1016/j.jpdc.2025.105142