CFSAN SNP Pipeline 2 : a pipeline for fast and accurate SNP distance estimation from bacterial genome assemblies

Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics p...

Full description

Saved in:
Bibliographic Details
Published inPeerJ. Computer science Vol. 11; p. e2878
Main Authors Literman, Robert, Gangiredla, Jayanthi, Rand, Hugh, Pettengill, James B
Format Journal Article
LanguageEnglish
Published PeerJ. Ltd 09.07.2025
PeerJ Inc
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics pipelines to extract high resolution single nucleotide polymorphisms (SNPs) grow in parallel with the size of pathogen clusters, where single strains of common pathogens such as Escherichia coli and Salmonella enterica can now contain hundreds or thousands of isolates. To facilitate rapid analysis of whole-genome sequencing (WGS) data for large clusters of foodborne bacterial pathogens, we introduce the CFSAN SNP Pipeline 2 (CSP2). CSP2 is a bioinformatics pipeline coded in Nextflow and Python that extracts SNPs directly from genome assemblies in seconds through rapid MUMmer whole-genome alignment and parallel processing. After genome alignment, most data processing steps mirror the quality control measures used in the CFSAN SNP Pipeline (CSP1), including density filtering and missing data handling. Analysis of simulated data finds that high quality assemblies from the strategic K-mer extension for scrupulous assemblies (SKESA) contain sufficient information for accurate, high resolution SNP distance estimation, while assemblies from the St. Petersburg genome assembler (SPAdes) contained more false positives. CSP2 SNP distances for 150 real-world clusters (50 each of E. coli, Listeria monocytogenes, and S. enterica) were highly correlated with those from CSP1 and the National Center for Biotechnology Information (NCBI) Pathogen Detection pipeline (E. coli r >= 0.98; Salmonella r = 0.99, Listeria r = 0.99). This evaluation of CSP2 demonstrates its comparability to accepted methods and validates its use within future traceback and outbreak investigations.
ISSN:2376-5992
2376-5992
DOI:10.7717/peerj-cs.2878