DH_Aligner: A fast short-read aligner on multicore platforms with AVX vectorization
The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignme...
Saved in:
Published in | Journal of parallel and distributed computing Vol. 205; p. 105142 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.11.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The rapid development of the NGS (Next-Generation Sequencing) technology leads to massive genome data produced at a much higher throughput than before, which leads to great demand for downstream fast and accurate genetic analysis. As one of the first steps of bio-informatical work-flow, read alignment makes an educated guess on where and how a read is mapped to a given reference sequence. In this paper, we propose DH_Aligner, a fast and accurate short read aligner designed and optimized for x86 multi-core platforms with avx2/avx512 SIMD instruction sets. It is based on a three-phased aligning work-flow: seeding-filtering-extension and provides an end-to-end solution for read alignment from Fastq to SAM files. Due to a fast seeding scheme and a seed filtering procedure, DH_Aligner can avoid both of a time-consuming seeding phase and redundant workload of aligning reads at seemingly wrong locations. With the introduction of batched-processing methodology, parallelism is easily exploited at data-, instruction- and thread-level. The performance-critical kernels in DH_Aligner are implemented by both avx2 and avx512 intrinsics for a better performance and portability. On two typical x86 based platforms: Intel Xeon-6154 and Hygon C86-7285, DH_Aligner can produce a near-best accuracy/sensitivity while outperform state-of-the-art parallel implementations with average speedup: 7.8x, 3.4x, 2.8x-6.7x and 1.5x over bwa-mem, bwa-mem2, bowtie2 and minimap2 respectively.
•Based on BWT+FM-Index, fast and accurate seeding schemes are devised.•Batched processing methodology help to exploit types of parallelism.•Core kernels are vectorized by avx512/avx2 instructions with on-chip optimizations.•Several times faster than SOAT implementations with negligible drop in accuracy. |
---|---|
ISSN: | 0743-7315 |
DOI: | 10.1016/j.jpdc.2025.105142 |