CFSAN SNP Pipeline 2 : a pipeline for fast and accurate SNP distance estimation from bacterial genome assemblies

Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics p...

Full description

Saved in:

Bibliographic Details
Published in	PeerJ. Computer science Vol. 11; p. e2878
Main Authors	Literman, Robert, Gangiredla, Jayanthi, Rand, Hugh, Pettengill, James B
Format	Journal Article
Language	English
Published	PeerJ. Ltd 09.07.2025 PeerJ Inc
Subjects	Analysis Anopheles Bacteria Bacterial genetics Biotechnology Chromosomes Escherichia coli Food Genomes Genomics Listeria Multiprocessing Pathogen Safety and security measures Single nucleotide polymorphisms SNP distance
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Accurate genetic distance estimation from pathogen whole-genome sequence data is critical for public health surveillance, and with respect to food safety it provides crucial information within traceback and outbreak investigations. The computational demands required for contemporary bioinformatics pipelines to extract high resolution single nucleotide polymorphisms (SNPs) grow in parallel with the size of pathogen clusters, where single strains of common pathogens such as Escherichia coli and Salmonella enterica can now contain hundreds or thousands of isolates. To facilitate rapid analysis of whole-genome sequencing (WGS) data for large clusters of foodborne bacterial pathogens, we introduce the CFSAN SNP Pipeline 2 (CSP2). CSP2 is a bioinformatics pipeline coded in Nextflow and Python that extracts SNPs directly from genome assemblies in seconds through rapid MUMmer whole-genome alignment and parallel processing. After genome alignment, most data processing steps mirror the quality control measures used in the CFSAN SNP Pipeline (CSP1), including density filtering and missing data handling. Analysis of simulated data finds that high quality assemblies from the strategic K-mer extension for scrupulous assemblies (SKESA) contain sufficient information for accurate, high resolution SNP distance estimation, while assemblies from the St. Petersburg genome assembler (SPAdes) contained more false positives. CSP2 SNP distances for 150 real-world clusters (50 each of E. coli, Listeria monocytogenes, and S. enterica) were highly correlated with those from CSP1 and the National Center for Biotechnology Information (NCBI) Pathogen Detection pipeline (E. coli r >= 0.98; Salmonella r = 0.99, Listeria r = 0.99). This evaluation of CSP2 demonstrates its comparability to accepted methods and validates its use within future traceback and outbreak investigations.
ISSN:	2376-5992 2376-5992
DOI:	10.7717/peerj-cs.2878