CFSAN SNP Pipeline 2 : a pipeline for fast and accurate SNP distance estimation from bacterial genome assemblies
Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics p...
Saved in:
Published in | PeerJ. Computer science Vol. 11; p. e2878 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
PeerJ. Ltd
09.07.2025
PeerJ Inc |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics pipelines to extract high resolution single nucleotide polymorphisms (SNPs) grow in parallel with the size of pathogen clusters, where single strains of common pathogens such as Escherichia coli and Salmonella enterica can now contain hundreds or thousands of isolates. To facilitate rapid analysis of whole-genome sequencing (WGS) data for large clusters of foodborne bacterial pathogens, we introduce the CFSAN SNP Pipeline 2 (CSP2). CSP2 is a bioinformatics pipeline coded in Nextflow and Python that extracts SNPs directly from genome assemblies in seconds through rapid MUMmer whole-genome alignment and parallel processing. After genome alignment, most data processing steps mirror the quality control measures used in the CFSAN SNP Pipeline (CSP1), including density filtering and missing data handling. Analysis of simulated data finds that high quality assemblies from the strategic K-mer extension for scrupulous assemblies (SKESA) contain sufficient information for accurate, high resolution SNP distance estimation, while assemblies from the St. Petersburg genome assembler (SPAdes) contained more false positives. CSP2 SNP distances for 150 real-world clusters (50 each of E. coli, Listeria monocytogenes, and S. enterica) were highly correlated with those from CSP1 and the National Center for Biotechnology Information (NCBI) Pathogen Detection pipeline (E. coli r >= 0.98; Salmonella r = 0.99, Listeria r = 0.99). This evaluation of CSP2 demonstrates its comparability to accepted methods and validates its use within future traceback and outbreak investigations. |
---|---|
ISSN: | 2376-5992 2376-5992 |
DOI: | 10.7717/peerj-cs.2878 |